Week 1

The R Environment

Running R Studio (and R) initiates an interactive session.

  • A session is all interaction between you and R.
  • Interaction with R consists of you instructing R to do something (e.g. make a graph, run an analysis), R processing your commands and producing some type of resultant output.

The R Studio environment has lots of windows that are helpful for environment and output management. There are four basic pods or panes.

  • Console: executing commands (R Markdown log lives here)
  • Files et al: folders browser, plots and packages
  • Document: scripts for analysis and R Markdown files
  • Environment/History: contents of the workspace, command history

The R Console is just a command line.

  • Command prompt is given by a \(>\)
  • Multiple commands separated by a ;
  • Output appears beneath the command line
  • Recall past commands with the up/down arrow keys
  • Clear out the command window using menu or keyboard shortcuts

R Scripts

R scripts (editors) is a text-like file that contains multiple commands which can be executed one at a time or in chunks.

  • They allow you to easily reuse and build on previous code
  • They save you lots of time

There are two ways to create a new, blank R script.

  • File Menu \(>\) New File \(>\) R Script
  • The add new file tab on the top-left

Include header information at the top of your script (name, date, description, etc). You can also place any directory changes and library calls. Save your script to your desired directory.

R Directories

Having the ability to move between various locations on your computer (or network) is critical to data management and analysis.

  • Changing directories can make it easier to read data and manage output (eg graphs).
  • When R opens, it will usually use a default document-type directory.
  • The default directory may be different on different machines.

The below code determines the current working directory.

getwd();

There are two ways to change/set your working directory.

  • Browse: in the Files tab, select the ellipses tab on the far right

    • Browse for the directory you want

    • Select the More tool, and then Set as Working Directory

  • Function call: use the function setwd and specify a working directory

    setwd("/Users/sframe_local/Downloads");

Once you are in a directory, you can browse using the Files tab or use the function dir from the command line.

dir();

R Packages

What makes R great and in some ways superior to other analysis software is the large amount of user-contributed packages.

  • Packages contain functions and data (specifically data frames).
  • There are standard packages than come with R, but most have to be installed.

There are two distinct steps to using an R package.

  1. Install the package. Just because the package is installed, this does not mean you have access to the contents.

  2. Load the package. You cannot load a package that isn’t installed.

The Packages tab shows the packages that are installed and which are loaded (checked). The function library lists the installed packages.

library();

Click on Install to install a package, input the name of the package and then hit Install. You can also run this from the command line.

install.packages("ggplot2");

Use Update to update packages.

From the Packages tab, click on a package name to get information about the package. You can also run the below code.

library(help=ggplot2);

Load a package that is installed by calling it through the command line.

library(ggplot2);

It is helpful to place library function calls at the top of your scripts.

Data Types and Objects

There are four data types in R.

  • Logical: TRUE (T or 1) and FALSE (F or 0)
  • Numeric: numbers
  • Character: letters and strings, but can be numbers
  • Factors: categorical variables originally numeric or character

There are five object types in R.

  • Vector: one-dimensional collection of elements of the same data type. Access the elements of a vector with [index].
  • Matrix: two-dimensional collection of vectors of the same data type. Access the elements of a matrix with [row index, col index].
  • Data Frame: collection of vectors of different data types. This is what most R functionality is designed to work with. Access the elements of a data frame with [row index, col index].
  • List: collection of elements of different data types. Most function output is a list. Access the elements of a list with either [index] (which returns a list) or [[index]] (which returns the actual object)
  • Function: takes in arguments (maybe), performs tasks, returns output. Function are accessed with ().

With R, indexing starts at 1 (rather than 0).

Uber Trips

Open and save a new R script (if you don’t already have one open). Below are costs for some of the Uber trips I have taken. \[1, 5.65, 7.54, 10.99, 11.02, 13.44, 27.23\]

  1. Use the function c to concatenate the values into a vector.

    uber = c(1, 5.65, 7.54, 10.99, 11.02, 13.44, 27.23);
  2. Use the functions is.vector and is.numeric to verify the vector is a vector and numeric data. There are other functions for the different data and object types (eg as.character, is.matrix, is.data.frame)

    is.numeric(uber);
    is.vector(uber);
  3. R indexing starting at 1 (rather than 0). Access elements of a vector by referring to their index location. This accesses the third element.

    uber[3];

    But you can access more elements in any order by creating a vector of indices.

    uber[c(5, 1, 3)];
  4. The function seq is used to create a sequence of values. Look at the help documentation for seq and pay close attention to the arguments from, to and by.

    ?seq

    Use the function to create a sequence of odd integers from 1 to 5. Then, use that sequence to specify the indices you want to access.

    uber[seq(1, 5, 2)];
  5. The function rep is used to repeat entire vectors or elements of vectors. Look at the help documentation for rep and pay close attention to the arguments x, times and each.

    ?rep

    The first two and last two uber trips were at night, the others were during the day. Use the function to create a variable called tod.

    tod=c(rep(c("night"), times=2), rep(c("day"), times=3), rep(c("night"), times=2));

    Is this a character vector?

  6. Use the function cbind to column bind the vectors uber and tod into a matrix named uberInfo.

    uberInfo = cbind(uber, tod);

    Is this a matrix or a data frame? Is it character or numeric?

  7. Matrices are good for matrix algebra and that’s about all. Data frames (and perhaps Tibbles later) are more common data containers. Use the function data.frame to create a data frame of the vectors. Use the function str function to display the structure of the data.

    uberInfo = data.frame(uber, tod);
    str(uberInfo);
  8. Suppose you want to change the names of the variables in the data frame. There are two ways to do so. The first way is to specify the names in the function data.frame itself.

    uberInfo = data.frame(cost=uber, time=tod);
    names(uberInfo);

    The second is the use the function names.

    names(uberInfo) = c("Cost", "Time");
    names(uberInfo);
  9. You can access a variable using the operator $.

    uberInfo$Cost;

    Now it is a vector, you can access elements of it.

    uberInfo$Cost[c(1,3,5)];
  10. You can also access columns and rows by numbers.

    uberInfo[,1];
    uberInfo[c(1,3,5),];
    uberInfo[c(1,3,5), 1];

Great White Sharks

Open and save a new R script. Below are the lengths (in feet) for several great white sharks. \[18.7, 12.3, 18.6, 16.4, 15.7, 18.3, 14.6\]

Complete each set of code by setting each Change_Me.

  1. Create a vector of the shark lengths. Verify the object is a vector. Verify the object contains numeric data.

    sharkLengths = c(Change_Me);
    is.vector(Change_Me);
    is.numeric(Change_Me);
  2. Apply the following functions to the vector: sqrt, sum, length, mean, median, summary, var and sd.

    sqrt(Change_Me);
  3. The standard deviation is given by \(s = \sqrt{\frac{\displaystyle \sum_{i=1}^{n} \left(x_{i}-\overline{x}\right)^{2}}{n-1}}\). Use the functions sqrt, sum, mean and length to compute the standard deviation in a single line of code. Verify your result using the function sd.

  4. You can convert feet to meters by the following: meters = feet * 0.3048. Convert the lengths to meters (but do not store them). R is vectorized, so just multiply the vector by the value.

    Change_Me*0.3048;
  5. Inspect the help documentation for the function quantile. Use the function to obtain the 25th percentile (\(Q_{1}\)) and the 75th (\(Q_{3}\)) percentile of the lengths (use the default type), and store the output in a new object. Then use the object to compute the IQR = \(Q_{3} - Q_{1}\).

    sharkQuants = quantile(Change_Me, probs=c(0.25, Change_Me));
    sharkQuants[Change_Me] - sharkQuants[Change_Me];
  6. Use the function quantile to obtain the median.

  7. Use the function seq to create a sequence of odd numbers from 1 to 7. Use this sequence to access the odd elements of the vector containing the lengths.

    seq(from=Change_Me, to=Change_Me, by=Change_Me);
    sharkLengths[seq(from=Change_Me, to=Change_Me, by=Change_Me)];
  8. The first five sharks lengths were taken from sharks found in California and the last two are from Mexico. Use the function rep to create a (character) vector for the location. Verify the object is a vector. Verify the object contains character data.

    sharkLocation = c(rep("CA", times=Change_Me), rep("MX", times=Change_Me));
    is.vector(Change_Me);
    is.Change_Me(sharkLocation);
  9. Create a data frame that contains the lengths and the location information.

    sharkData = data.frame(length=Change_Me, location=Change_Me);
  10. Regardless of your naming convention, change the names of the variables to Length and Location. Access the length column in two different ways.

    names(Change_Me) = c("Change_Me", "Change_Me");
    sharkData$Change_Me;
    sharkData[,"Change_Me"];
  11. Apply the following functions to the data frame: dim, nrow, ncol, names, row.names.

    dim(Change_Me);

Homework #1

Produce command code for each of the homework problems. Create an R script for your homework that contains your code. The homework quiz will ask you about the results you generate, and some questions will ask you to copy/paste your code. The code chunks that have a Change_Me need to be completed by changing Change_Me to the correct code.

  1. Use the functions seq and/or rep to create the following sequences.

    1. 1 2 3 4 1 2 3 4 1 2 3 4

      rep(seq(from=Change_Me, to=Change_Me, by=Change_Me), times=Change_Me);
    2. 10.00000 10.04545 10.09091 10.13636 10.18182 10.22727 10.27273 10.31818 10.36364 10.40909 10.45455 10.50000

      seq(from=Change_Me, to=Change_Me, length.out=Change_Me);
    3. “1” “1” “2” “2” “3” “3” “bonkers” “bonkers”

      rep(c(Change_Me:Change_Me, "bonkers"), each=Change_Me);
  2. Access the data frame stackloss by just typing the name in the command line (this data frame is part of the library base which loads automatically).

    1. Create a new data frame called tempacid from the acid.conc and water.temp variables. Name these variables acid and temperature.

    2. Use command code in two different ways to access the acid variable in tempacid.

    3. Use command code in two different ways to access the tenth observation of the temperature variable in tempacid.

  3. The following are a sample of observations on incoming solar radiation in a greenhouse. \[11.1, 10.6, 3.6, 8.8, 7.10, 11.2, 9.8, 12.2\]

    1. Create a vector for these data called radValues.

    2. Add 10 to each of the values but do not store the result.

    3. Apply the functions mean, summary, length and sqrt to the vector.

    4. Two new observations become available: 7.2 and 12.4. Update the radValues vector with the new observations by concatenating your old radValues vector and these new values.

    5. The standard deviation is given by \(s = \sqrt{\frac{\displaystyle \sum_{i=1}^{n} \left(x_{i}-\overline{x}\right)^{2}}{n-1}}\). Use the functions sqrt, sum, mean and length to compute the standard deviation in a single line of code. Verify your result using the function sd.

  4. Cal Poly has sent you a message it wants you to decode: 23068, 26512, 27424, 19232, 26512, 27868, 57324, 18552, 12768, 24360, 16752, 28396.

    1. Create a vector called message that contains the given integers. Reverse the order of the elements so the first becomes the last and so on. There is a function you can use for this, ?rev.

    2. Convert the message into a matrix with two columns filled in by column.

    3. Set the second column equal to 40,000 subtract the second column. Then set the first column equal to (the first column subtract the second column) all divided by two.

    4. Convert the matrix back to a vector. Set the message elements equal to the square root of (the message elements divided by four then subtract eight).

    5. The decoding key consists of all lower-case letters (letters is an object available to you), upper-case letters (LETTERS), the digits 0-9, and the below punctuation.

      punc = c(".", ",", "!", "?", "'", '"', "(", ")", " ", "-", ";", ":", ' ');

      Create a vector called key that contains the letters, LETTERS, digits, and punc (in this order).

    6. Your transformed message is a vector of indices that correspond to the decoding key. Use these indices to extract (subset) the elements of the decoding key. Display the message as a single character string (hint: consider the function paste with the appropriate argument collapse specification).

    7. The previous parts walk you though the process step by step. Assemble all of your code together here, and set your code up so that it runs from start to finish and displays the decoded message.

Week 2

Some Functions for Reading Data

Below are some functions to read data.

  • read.table: .txt and .dat files
  • read.csv: .csv files
  • read.fwf: fixed-width files (usually .txt and .dat)
  • read.xlsx: in the package xlsx is used for xlsx files
  • read.xlsx: in the package openxlsx is used for xlsx files
  • read_xlsx: in the package readxl is used for xlsx files
  • The packages RMySQL and RPostgreSQL are useful for making connections to, querying and reading SQL data bases
  • Webscrapping: RCurl, httr, rvest, XML

There are comparable functions for writing data to files as well (eg write.csv).

To read a file, you will need to do one of the below.

  • Change your working directory to the location of the file
  • Include in the file name the complete path

Factors

Factors are used to represent categorical variables.

  • Define groups or categories, take on a small number of values
  • The values could be either numeric or character to start
  • Factors map integers to character strings

A factor has levels: the unique values of the variable.

  • The order of the levels can be important
  • By default, the factor levels are the unique values in alphabetical or numerical order

A summary of the arguments to the function factor are below.

  • levels: specify the levels, in particular the ordering
  • labels: specify how the values are displayed
  • ordered: if TRUE, indicates the ordering is important

Use the function levels to determine the levels.

Legos

Download the files Legos.csv and Legos.xlsx. Put them in a directory you want to work in and determine the path.

  1. Change your directory and read the CSV file using the function read.csv.

    # change the below to your directory
    setwd("/Users/sframe/Downloads");
    
    # now read the file
    data = read.csv(file="Legos.csv", header=TRUE);

    You could also include the path in the filename.

    # change the below to your directory
    data = read.csv(file="/Users/sframe/Downloads/Legos.csv", header=TRUE);

    Use the function str to display the structure of the data.

    str(data);

    What data type is the Product variable?

  2. Use the functions head and tail to see the first/last rows.

    head(data);
    tail(data);

    Consider the age for the second observation and that it is missing.

  3. Use the function is.na to look for missing values in the first few cases.

    is.na(head(data));

    Even though it is missing, R is not interpreting it as a missing value (NA).

  4. Now read in the data but use the argument na.strings and specify strings that represent missing values in the file.

    data = read.csv(file="Legos.csv", header=TRUE, na.strings=c(""));

    Now check the data.

    head(data);
    is.na(head(data));
  5. The function na.omit will remove rows with at least one missing value.

    data = na.omit(data);
    head(data);

    Notice that entire row is gone.

  6. Other useful functions about the dimensions are below.

    dim(data);
    nrow(data);
    ncol(data);
  7. Use the function factor to convert variables to a factor variable and change the order of the levels.

    data$Gender = factor(data$Gender, levels=c("Male", "Female"));
  8. Use the function names to get and change the names of the variables.

    names(data);

    Suppose we want to change Gender to Sex, which is the fifth name.

    names(data)[5] = "Sex";
    names(data);
  9. There are several ways to access vectors.

    data$Price;

    And then you can access elements of the vector.

    # access first 
    data$Price[1];
    
    # access 3, 5, 7
    data$Price[c(3,5,7)];
    
    # access 9, 11, 13
    data$Price[seq(9,13,2)];
  10. You can access several rows at a time. Not referencing the column dimension gives all columns.

    data[seq(1,5,2), ];
  11. You can access several columns at a time. Not referencing the row dimension gives all rows.

    data[,c("Pieces", "Price")];
    data[,c(3,4)];
  12. Open the help documentation for the function read.xlsx in the library openxlsx. Use this function to read the XLSX file. Examine the structure of the data and display the head of the data.

    library(openxlsx);
    data2 = read.xlsx(xlsxFile="Legos.xlsx", sheet=1, colNames=TRUE);
    detach("package:openxlsx", unload=TRUE);

Merging

The function merge is useful for merging data frames. Below is a summary of the arguments to the function.

  • x and y: the data frames to merge
  • by: the variable name to merge by
  • by.x and by.y: the variables names in each object to merge by
  • all, all.x, all.y: logical values that specify the type of merge

Types of joins determined by the all arguments

Lets set up some different data frames for customer ID, product and state information.

df1 = data.frame(CustomerId = c(1:7),
                 Product = c(rep("Oven", 3), rep("Television", 4)));

df2 = data.frame(CustomerId = c(2, 4, 6), 
                 State = c(rep("California", 2), rep("Texas", 1)));

Notice that df2 contains a subset of the customers.

  • Natural join: return only the rows in which the left table have matching keys in the right table

    md1 = merge(x=df1, y=df2, by="CustomerId", all=FALSE);
    md1;
  • Outer join: returns all rows from both tables, join records from the left which have matching keys in the right table

    md2 = merge(x=df1, y=df2, by="CustomerId", all=TRUE);
    md2;

    Use the functions is.na and na.omit detect instances of missing values and omit rows from data frames with at least one missing value.

    is.na(md2);
    na.omit(md2);
  • Left outer join: return all rows from the left table, and any rows with matching keys from the right table

    md3 = merge(x=df1, y=df2, by="CustomerId", all.x=TRUE);
    md3;
  • Right outer join: return all rows from the right table, and any rows with matching keys from the left table

    md4 = merge(x=df1, y=df2, by="CustomerId", all.y=TRUE);
    md4;
  • Cross join: a cross join (also sometimes known as a Cartesian Join) results in every row of one table being joined to every row of another table

    md5 = merge(x = df1, y = df2, by = NULL);
    md5;

Subsetting

When we access the elements of a vector or data frame, we were secretely subsetting our data. There are two ways of subsetting.

uber = c(1, 5.65, 7.54, 10.99, 11.02, 13.44, 27.23);
tod=c(rep(c("night"), times=2), rep(c("day"), times=3), rep(c("night"), times=2));
uberInfo = data.frame(cost=uber, time=tod);
  • Use logic to identify cases to either keep or drop. The function which is a useful function to identify cases that meet a certain condition.

    whichNight = which(uberInfo$time=="night");
    uberInfo[whichNight,];

    You can do this directly though.

    uberInfo[(uberInfo$time=="night"),];
  • Use the function subset. Consult the help documentation for the function.

    subset(uberInfo, subset=(time=="night"));

Sorting

The function sort will sort values and return a sorted version of the input vector.

prices = c(44, 55, 54, 52, 19, 20, 16);
year = c(rep(2014, times=3), rep(2015, times=4));
wineDf = data.frame(prices, year);

wineDf$prices;
sort(wineDf$prices);

The function rank ranks the values in the input vector. In the output, the index location of each rank corresponds to the index location of the input vector.

wineDf$prices;
rank(wineDf$prices);

The function order determines the index order of the sorted input vector.

  • The output vector contains the index values specifying the order of the input vector so the input vector will be sorted (ie how to permute them).

  • The index location of each value does not correspond to the index location of the input vector.

rbind(1:7, wineDf$prices);
order(wineDf$prices);

Order the rows of the data frame (or another vector) based on the order of the prices.

wineDf[order(wineDf$prices),];

The Workspace

An object is any vector, matrix, data frame, list or function.

  • When R is initially started, it creates an empty workspace
  • An R workspace or environment consists of at least one object but could be empty

The contents of the workspace can be seen in the Environment pane. The funciton ls returns a vector that contains the names of the objects in the workspace.

ls();

You can save a workspace with any collection of objects to a .Rdata file that can be easily loaded again.

  • This is a good idea if it was expensive or difficult to create a data frame you want to use again
  • Use the function save, and note that this file will be saved in your current directory unless you specify a path name
save(wineDf, file="WineData.Rdata");

Verify the file has been saved in the current directory.

dir();

The function save.image will save everything.

save.image("Everything.Rdata");

When you close R and it asks you if you want to “Save workspace…” this is what R is basically doing.

You can remove one one or more objects from the workspace.

rm(df1, df2);
rm(list=ls());

Use the function load to load a workspace. Note you do not use assignment here!

load("Everything.Rdata");

Global Super Stores

The file Global Superstore.xlsx contains order-level information on sales from a store. There are sheets for orders and returns. Download the data, then open and inspect the file. Do not include directory changes or paths in your file names in your submitted code.

  1. Use the function read.xlsx in the library openxlsx to read the orders sheet (specify an integer). How many observations and variables are there? What are the names of the variables? Hint: use dim and names.

    library(Change_Me);
    orders = read.xlsx("Global Superstore.xlsx", sheet=Change_Me, check.names=TRUE, detectDates=TRUE);
    Change_Me(orders);
    Change_Me(orders);
  2. Read the returns sheet (specify an integer). How many observations and variables are there? What are the names of the variables?

    returns = read.xlsx(Change_Me, sheet=Change_Me, check.names=Change_Me, detectDates=Change_Me);
  3. Do a natural join of the orders (x) and returns (y) data by the order ID. How many observations and variables are there? What are the names of the variables? Have any of the variable names changed?

    md = merge(orders, returns, by=Change_Me, all=Change_Me);
  4. Subset the merged data for orders/returns from the USCA market using the Market.y variable. How many observations are there? Look at the head of the subset. Hint: use nrow and head.

    mdb = md[md$Change_Me=="USCA",];
    Change_Me(mdb);
    Change_Me(mdb);
  5. Convert the category variable into a factor. What are the factor levels?

    mdb$Category = Change_Me(mdb$Category);
    Change_Me(mdb$Category);
  6. Subset the subset one last time for orders/returns from the furniture category. How many observations are there? Look at the head of the subset.

    mdc = mdb[mdb$Change_Me=="Change_Me",];
  7. Sort the entire subset by the profit from largest to smallest. Look at the head of the subset.

    mdc = mdc[Change_Me(mdc$Profit, decreasing=Change_Me),];

Homework #2

For these problems, the directory that I execute your code in will contain the files you need to read. Do not include directory changes or paths in your file names in your submitted code, and do not alter the files I have distributed.

  1. The file Basketball.txt contains some statistics on professional basketball players. These fields are a fixed-width format so use the function read.fwf to read the file (hint: check out the argument widths). Use the function str to display the structure.

  2. The file SharkAttacks.csv contains the number of shark attacks for various decades. The file SharkFatalities.csv contains the percentage of shark attacks which are fatal for various decades.

    1. Read the file for the attack records. Specify the appropriate argument to identify the missing value(s).

    2. Read the file for the fatality records. Specify the appropriate argument to identify the missing value(s).

    3. Merge these two files together with a full outer join.

    4. Remove any rows with missing values.

  3. The file classData.Rdata contains a number of data frames for various classes that I teach.

    1. Load the workspace.

    2. How many data frames does the workspace have?

    3. How many rows and columns are in each of the following data frames: amtrak, bankLoss, gasOil, kickstarter?

  4. Quantitative finance involves a lot of data management and statistics. This exercise walks you through the process of obtaining and managing stock/Libor data, and using that information to create a CAPM model based on daily log returns adjusted by a risk-free interest rate. Do not use any iteration on this problem, everything can be done with vectorized operations.

    1. This is a link to the adjusted closing levels of the S&P500 (SPX). Copy and paste the link. Use the link to read the CSV file directly from Yahoo! Finance (ie don’t download and read the file locally).

    2. This is a link to the adjusted closing prices of Lockhead Martin (LMT). Copy and paste the link. Use the link to read the CSV file directly from Yahoo! Finance (ie don’t download and read the file locally).

    3. Merge these two data frames together by the date with a natural join.

    4. For both SPX and LMT, create log returns given by \(R_{t} = log\left(\frac{P_{t}}{P_{t-1}}\right)*100\) where \(P_{t}\) is the adjusted closing price at time \(t\). Create a new data frame that contains the dates and the log returns.

    5. The file USDONTD156N.xlsx contains Libor rates (the risk-free interest rate we will use). Read this file.

    6. Merge the data frame with the log returns and the Libor rates by date with a natural join. Remove any missing values.

    7. Subtract the Libor rate from the log returns of SPX and then also LMT (this creates log-adjusted returns, adjusted by the risk-free rate of Libor).

    8. Find the correlation between the SPX and LMT log-adjusted returns. Fit a regression model that uses the SPX log adjusted returns to predict the LMT log-adjusted returns. Hint: use the function lm. Look at a summary of the model object. The intercept and slope of the estimated model are kinda close to the alpha and beta of LMT (based on daily data and not monthly).

Week 3

Summaries for Categorical Data

The primary summary for categorical data are counts (frequencies) and proportions (relative frequencies).

  • table: counts for univariate or bivariate categorical data
  • prop.table: converts counts into proportions, for bivariate data it provides proportions in three ways according to the margin (two of which provide conditional proportions)

Example: consider the data frame customerChurn in the workspace classData.Rdata.

  • Obtain counts and proportions for the type of contract.

    table(customerChurn$Contract);
    prop.table(table(customerChurn$Contract));    
  • Obtain counts for the type of contract cross-classified by if the customer has a partner. This called a contingency table or a crosstab.

    table(customerChurn$Partner, customerChurn$Contract);

    Applying the function prop.table without a margin makes all the proportions in the table sum to 1.

    prop.table(table(customerChurn$Partner, customerChurn$Contract));

    If you set margin=1, the rows sum to 1 (ie conditional on having a partner).

    prop.table(table(customerChurn$Partner, customerChurn$Contract), margin=1);

    If you set margin=2, the columns sum to 1 (ie conditional on contract type).

    prop.table(table(customerChurn$Partner, customerChurn$Contract), margin=2);

Summaries for Numerical Data

The standard summaries for numerical data include the mean/average, median, standard deviation and sample size. Typically, you will need numerical summaries of a numeric variable by the levels of a categorical variable or factor.

  • tapply: apply a function (usually a numerical summary) to a numeric variable by the levels of a factor
  • apply: apply a function to either the rows or columns of a data frame, which depends on the margin

Example: consider the data frame customerChurn in the workspace classData.Rdata.

  • Obtain summary statistics of the monthly charges for customers with and without a partner.

    tapply(customerChurn$MonthlyCharges, customerChurn$Partner, mean);
    tapply(customerChurn$MonthlyCharges, customerChurn$Partner, median);
    tapply(customerChurn$MonthlyCharges, customerChurn$Partner, sd);
    tapply(customerChurn$MonthlyCharges, customerChurn$Partner, length);
  • Obtain summary statistics of the monthly charges cross-classified by having a partner and contract type.

    tapply(customerChurn$MonthlyCharges, list(customerChurn$Partner, customerChurn$Contract), mean);
    tapply(customerChurn$MonthlyCharges, list(customerChurn$Partner, customerChurn$Contract), median);
    tapply(customerChurn$MonthlyCharges, list(customerChurn$Partner, customerChurn$Contract), sd);
    tapply(customerChurn$MonthlyCharges, list(customerChurn$Partner, customerChurn$Contract), length);
  • Obtain summary statistics of the tenure (length of time being a customer) and the monthly charges.

    apply(customerChurn[,c("tenure", "MonthlyCharges")], 2, mean);

    You could also apply this to the rows, although it would be meaningless in this situation.

    apply(customerChurn[,c("tenure", "MonthlyCharges")], 1, mean);
  • Find the correlation between two numeric variables.

    cor(customerChurn$tenure, customerChurn$MonthlyCharges);

    Or if you want a correlation matrix, give R the entire data frame.

    cor(customerChurn[,c("tenure", "MonthlyCharges")]);

Saratoga Home Prices I

The data frame saratoga in the workspace classData.Rdata contains the sales price, number of bedrooms, and if homes have a fireplace or not.

  1. Obtain counts and proportions for the fireplace variable.

    table(saratoga$Change_Me);
    Change_Me.table(Change_Me(saratoga$Change_Me));  
  2. Create a contingency table for the fireplace and number of bedrooms variables.

    Change_Me(saratoga$Change_Me, saratoga$Bedrooms);
  3. Use your table to find proportions that are conditional on the fireplace variable.

    prop.Change_Me(table(saratoga$Fireplace, saratoga$Change_Me), margin=Change_Me);
  4. Find the mean, median, standard deviation and sample size of the prices with and without a fireplace.

    tapply(saratoga$Change_Me, saratoga$Fireplace, mean);
    Change_Me(saratoga$Price, saratoga$Fireplace, median);
    tapply(saratoga$Price, saratoga$Change_Me, sd);
    tapply(saratoga$Price, saratoga$Fireplace, Change_Me);
  5. Find the mean, median, standard deviation and sample size of the prices cross-classified by the fireplace and number of bedrooms.

    tapply(saratoga$Change_Me, list(saratoga$Fireplace, saratoga$Bedrooms), mean);
    Change_Me(saratoga$Price, list(saratoga$Change_Me, saratoga$Bedrooms), median);
    tapply(saratoga$Price, list(saratoga$Fireplace, saratoga$Change_Me), sd);
    tapply(saratoga$Price, Change_Me(saratoga$Change_Me, saratoga$Change_Me), Change_Me);
  6. Use the function apply to find the total number of bedrooms and bathrooms for each home.

    sumOut = apply(saratoga[,c("Baths", "Bedrooms")], Change_Me, sum);
  7. Find the mean and standard deviation of the total number of rooms.

    mean(Change_Me);
    Change_Me(sumOut);

Concert Concession Sales

The workspace classData.Rdata contains the data frame concert. The data are the concession sales for a local concert venue. For each concert, there is data on the day of week (DOW), month, concession stand profit (Prof), concert attendance (Att), and a variable indicating if the concert took place on a weekend (We).

  1. Find counts (and proportions) of concerts for the month.

  2. Find counts (and proportions) of concerts for weekdays vs. weekends.

  3. Create a contingency table of the concerts by month and weekday vs. weekend.

  4. Find the mean, median, standard deviation and sample size of the profits for each month.

  5. Find the mean, median, standard deviation and sample size of the profits for weekdays vs. weekends.

  6. Find the mean, median, standard deviation and sample size of the profits cross-classified by month and weekdays/weekends.

  7. Find the correlation between the profit and attendance.

Homework #3

The workspace classData.Rdata contains the data frame bankLoss which has information on bank losses including the region (Region), risk category (Risk.Category), net loss (Net.Loss) and recovery amount of each loss event (Recovery.Amount..percent.). Produce command code to answer each part. Do each part with both base and tidyr methods. For a bit more on summaries including counts and proportions with tidy, check out this page. Do not alter the original data frame or create a duplicate copy.

  1. Obtain counts and proportions for the risk categories.

  2. Create a contingency table for the region and risk category. Find sample proportions for the risk categories conditional on region.

  3. Obtain summary statistics for the following variables: net losses and recovery amounts. Use the function sum to find total net losses.

  4. Obtain summary statistics for the net losses by region. Use the function sum to find the total net losses by region. Then determine the percentage of the total net losses by region.

  5. Obtain summary statistics for the net losses by region and risk category. Find the total net losses by region and risk category. Then determine the percentage of the total net losses by region and risk category.

  6. Find the correlation between the net losses and recovery amounts.

  7. Find the correlation between the net losses and recovery amounts by region and risk category. Only do this using tidy functionality.

Week 4

Summary of Base Plotting Functionality

With either base or ggplot2, below are guiding principles for making visualizations.

  • Build things in layers, adding one thing at a time
  • Don’t waist your time seriously annotating a graph until you are sure you want to use it

Below is a summary of the functionality for the base plotting functions. These arguments are specific to the function plot, but many will work with other base plotting functions.

  • \(Y \sim X\): formula input
  • xlab, ylab: labels the x-axis and y-axis
  • xlim, ylim: defines the span of X and Y values
  • main: specifies the title
  • type: p for points, l for lines, b for both, n for none
  • lty: 1=solid, 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash
  • pch: 19=solid circle, 20=bullet (smaller circle), 21=circle, 22=square, 23=diamond
  • col: integers or names see colors() or an rgb specification
  • axes: if FALSE turns off the axes

Below are some useful plotting functions.

  • windows (Windows) / quartz (MAC): creates a new graphics window/device

  • par: sets graphical parameters

    • par(mfrow=c(R,C)): sets the graphics device to allow for multiple graphs on the same graph sheet organized by row into a grid of R by C graphs

    • par(mfcol=c(R,C)): similarly but organized by column

  • legend: make a legend

    • x,y: specifies the location of the top left corner of the box

    • locator(1): to manually place the legend

    • legend: specifies text in the legend

    • pch, col, lty specify characteristics; use NA when certain controls are not used

  • abline: places a line on a graph

    • h for a horizontal line, v for a vertical line

    • a and b specify an intercept and slope of a line

    • reg specifies linear model objects

    • coef specifies the coefficients of a linear model object

  • axis: build your own axes

    • side specifies which axis

    • at specifies where the ticks are

    • labels specifies what the ticks are

  • arrows: draws lines from an origin to a specified point

    • x0, y0 specify the the origin

    • x1, y1, specify the destination

    • code specifies the type

    • angle specifies the angle of the arrow

  • mtext: place text on a graph

    • text is the text to be placed

    • side controls the location (1=bottom, 2=left, 3=top, 4=right)

  • text: allows you to place text in the graph to label points

    • x,y is the location to place text, usually corresponds to x,y pairs of interest

    • labels is the text given to said pairs

    • pos controls the location (1=bottom, 2=left, 3=top, 4=right)

Plots

The data frame customerChurn in the workspace classData.Rdata contains the tenure (length of time being a customer) and total charges for customers at a telecom company. Here we consider the relationship between the tenure and total charges.

  • Start with a very basic plot

    plot(customerChurn$tenure, customerChurn$TotalCharges);

    We can also specify a formula, Y~X.

    plot(TotalCharges~tenure, data=customerChurn);
  • Add a few basic annotations.

    plot(TotalCharges~tenure, data=customerChurn, xlab="Length of time as customer", ylab="Total charges", 
         main = "Look at that funnel shape", 
         pch = 20, col="blue");

    The function rgb is great for dense data.

    plot(TotalCharges~tenure, data=customerChurn, xlab="Length of time as customer", ylab="Total charges", 
         main = "Look at that funnel shape", 
         pch = 20, col=rgb(0, 0, 1, 0.1));
  • Change the axis limits.

    plot(TotalCharges~tenure, data=customerChurn, xlab="Length of time as customer", ylab="Total charges", 
         main = "Look at that funnel shape", 
         pch = 20, col=rgb(0, 0, 1, 0.1), 
         xlim=c(0, 80), ylim=c(0, 10000));

    Now change the ticks.

    plot(TotalCharges~tenure, data=customerChurn, xlab="Length of time as customer", ylab="Total charges", 
         main = "Look at that funnel shape", 
         pch = 20, col=rgb(0, 0, 1, 0.1), 
         xlim=c(0, 80), ylim=c(0, 10000), axes=FALSE);
    axis(side=1, at=seq(0, 80, 10));
    axis(side=2, at=seq(0, 10000, 2500), labels=c("$0", "$2500", "$5000", "$7500", "$10000"));
  • Now add some coloring for a categorical variable. First lets set up a variable for indexing.

    whereNo = which(customerChurn$Dependents=="No");
    whereYes = which(customerChurn$Dependents=="Yes");

    Now use the function points to add points.

    plot(TotalCharges~tenure, data=customerChurn, xlab="Length of time as customer", ylab="Total charges", 
         main = "Look at that funnel shape", 
         pch = 20, col="blue", 
         xlim=c(0, 80), ylim=c(0, 10000), axes=FALSE, type="n");
    axis(side=1, at=seq(0, 80, 10));
    axis(side=2, at=seq(0, 10000, 2500), labels=c("$0", "$2500", "$5000", "$7500", "$10000"));
    points(customerChurn$tenure[whereNo], customerChurn$TotalCharges[whereNo], pch=18, col=rgb(0,0,1,0.25));
    points(customerChurn$tenure[whereYes], customerChurn$TotalCharges[whereYes], pch=20, col=rgb(1,0.4,0,0.25));
  • Add a legend using the function legend. The components of the arguments legend, col, pch and lty are all index linked. The location of the legend can be specified several ways, locator(1) gives manual control.

    plot(TotalCharges~tenure, data=customerChurn, xlab="Length of time as customer", ylab="Total charges", 
         main = "Look at that funnel shape", 
         pch = 20, col="blue", 
         xlim=c(0, 80), ylim=c(0, 10000), axes=FALSE, type="n");
    axis(side=1, at=seq(0, 80, 10));
    axis(side=2, at=seq(0, 10000, 2500), labels=c("$0", "$2500", "$5000", "$7500", "$10000"));
    points(customerChurn$tenure[whereNo], customerChurn$TotalCharges[whereNo], pch=18, col=rgb(0,0,1,0.25));
    points(customerChurn$tenure[whereYes], customerChurn$TotalCharges[whereYes], pch=20, col=rgb(1,0.4,0,0.25));
    legend(locator(1), legend=c("No dependents", "Dependents"), col=c("blue", "orange"), pch=c(18, 20));

    Or try the below to specify the top left corner of the box.

    plot(TotalCharges~tenure, data=customerChurn, xlab="Length of time as customer", ylab="Total charges", 
         main = "Look at that funnel shape", 
         pch = 20, col="blue", 
         xlim=c(0, 80), ylim=c(0, 10000), axes=FALSE, type="n");
    axis(side=1, at=seq(0, 80, 10));
    axis(side=2, at=seq(0, 10000, 2500), labels=c("$0", "$2500", "$5000", "$7500", "$10000"));
    points(customerChurn$tenure[whereNo], customerChurn$TotalCharges[whereNo], pch=18, col=rgb(0,0,1,0.25));
    points(customerChurn$tenure[whereYes], customerChurn$TotalCharges[whereYes], pch=20, col=rgb(1,0.4,0,0.25));
    legend(10, 10000, legend=c("No dependents", "Dependents"), col=c("blue", "orange"), pch=c(18, 20));

    Or try the generic placement.

    plot(TotalCharges~tenure, data=customerChurn, xlab="Length of time as customer", ylab="Total charges", 
         main = "Look at that funnel shape", 
         pch = 20, col="blue", 
         xlim=c(0, 80), ylim=c(0, 10000), axes=FALSE, type="n");
    axis(side=1, at=seq(0, 80, 10));
    axis(side=2, at=seq(0, 10000, 2500), labels=c("$0", "$2500", "$5000", "$7500", "$10000"));
    points(customerChurn$tenure[whereNo], customerChurn$TotalCharges[whereNo], pch=18, col=rgb(0,0,1,0.25));
    points(customerChurn$tenure[whereYes], customerChurn$TotalCharges[whereYes], pch=20, col=rgb(1,0.4,0,0.25));
    legend("topleft", legend=c("No dependents", "Dependents"), col=c("blue", "orange"), pch=c(18, 20));
  • Suppose we have a graph you want to export for a deliverable. One option is to manually save it (click Export in the plot pane). The device function (for which there are many) to save a graph as a PDF is the function pdf. The benefit of PDFs is that you can save multiple graphs as different pages in a single PDF file.

    pdf(file="TenureVsCharges.pdf");
    plot(TotalCharges~tenure, data=customerChurn, xlab="Length of time as customer", ylab="Total charges", 
         main = "Look at that funnel shape", 
         pch = 20, col="blue", 
         xlim=c(0, 80), ylim=c(0, 10000), axes=FALSE, type="n");
    axis(side=1, at=seq(0, 80, 10));
    axis(side=2, at=seq(0, 10000, 2500), labels=c("$0", "$2500", "$5000", "$7500", "$10000"));
    points(customerChurn$tenure[whereNo], customerChurn$TotalCharges[whereNo], pch=18, col=rgb(0,0,1,0.25));
    points(customerChurn$tenure[whereYes], customerChurn$TotalCharges[whereYes], pch=20, col=rgb(1,0.4,0,0.25));
    legend("topleft", legend=c("No dependents", "Dependents"), col=c("blue", "orange"), pch=c(18, 20));
    dev.off();

Barcharts

Barplots or barcharts are useful for visualizing summarized categorical data. Barcharts can also be used to convey summarized numerical data (eg numerical statistics or discrete versions).

  • The function barplot does not summarize data, rather it accepts summaries which it uses as bar heights.

    barplot(table(customerChurn$Contract));    
  • Create a stacked barchart.

    barplot(prop.table(table(customerChurn$Partner, customerChurn$Contract), margin=2), 
            legend.text = c("No partner", "Partner"), col=c("orange", "blue"));
  • I tend to prefer a side-by-side barchart.

    barplot(prop.table(table(customerChurn$Partner, customerChurn$Contract), margin=2), 
            legend.text = c("No partner", "Partner"), col=c("orange", "blue"), 
            beside=TRUE, horiz=TRUE);
  • Mosaic plots are like stacked bar charts except the bar widths are proportional to the sample sizes.

    mosaicplot(table(customerChurn$Partner, customerChurn$Contract));    

Histograms

Histograms are used to visualize the distribution of numeric variable.

par(mfrow=c(2,1));
hist(coasters$Age[coasters$Type=="Wooden"], xlab="", main="Wood Ages", 
     freq=FALSE, breaks = seq(0, 110, 10), xlim=c(0, 110), ylim=c(0, 0.06));
hist(coasters$Age[coasters$Type=="Steel"], xlab="", main="Steel Ages", 
     freq=FALSE, breaks = seq(0, 110, 10), xlim=c(0, 110), ylim=c(0, 0.06));
par(mfrow=c(1,1));

Boxplots

The standard boxplot is a way to visualize a five-number summary: the minimum, the \(25^{th}\) percentile, the median, the \(75^{th}\) percentile, and the maximum.

For the box-and-whisker plot, the outer bars are the last observations that are not outliers and outliers are noted as points beyond these whiskers (values 1.5 IQRs below Q1 and 1.5 IQRs above Q3).

If you assign the boxplot call, you get more detailed information about the outliers which is helpful if there are repeated values.

boxOut = boxplot(Age~Type, data=coasters, horizontal=TRUE);
boxOut;

Saratoga Home Prices II

The data frame saratoga in the workspace classData.Rdata contains the sales price, number of bedrooms, and if homes have a fireplace or not. For all of these problems, do not alter the original data frame or create new ones. Use the plotting functionality to set the aesthetics and annotations.

  1. Make the below visualization.

    barplot(prop.Change_Me(table(Change_Me$Bedrooms)), Change_Me=c(0,0.5), 
            main="Distribution of the number of bedrooms");

  2. Make the below visualization.

    barplot(prop.table(table(saratoga$Fireplace, saratoga$Bedrooms), margin=Change_Me), 
            Change_Me = c("No Fireplace", "Fireplace"), Change_Me=c("orange", "blue"), 
            Change_Me = "Number of bedrooms");

  3. Make the below visualization. Hint: use the functions density and lines.

    hist(saratoga$Price, main="Price distribution with density estimate", xlab="", ylab="", 
         axes=Change_Me, freq=Change_Me, ylim=c(0,0.00001));
    axis(side=Change_Me, Change_Me=seq(0,900000,300000), Change_Me=c("$0", "300k", "600k", "900k"));
    lines(Change_Me(saratoga$Price), col = "blue");

  4. Make the below visualization.

    par(Change_Me=c(2,1));
    hist(saratoga$Price[saratoga$Fireplace=="Yes"], main="Homes with Fireplaces", xlab="", ylab="",
         axes=Change_Me, freq=FALSE, xlim=c(0,900000), breaks=seq(0,900000,50000), ylim=c(0,0.00001));
    axis(side=Change_Me, at=seq(0,900000,300000), labels=c("$0", "300k", "600k", "900k"));
    hist(saratoga$Price[saratoga$Fireplace=="No"], main="Homes without Fireplaces", xlab="", ylab="",
         axes=FALSE, freq=Change_Me, xlim=c(0,900000), breaks=seq(0,900000,50000), ylim=c(0,0.00001));
    axis(side=1, Change_Me=seq(0,900000,300000), labels=c("$0", "300k", "600k", "900k"));
    par(mfrow=c(Change_Me,1));

  5. Make the below visualization.

    boxplot(Change_Me~Change_Me, data=saratoga, horizontal=Change_Me, axes=Change_Me, xlab="Price", ylab="");
    axis(side=Change_Me, Change_Me=seq(0,900000,300000), labels=c("$0", "300k", "600k", "900k"));
    axis(side=Change_Me, at=seq(1,7,1), Change_Me=c("1bd", "2bd", "3bd", "4bd", "5bd", "6bd", "7bd"));

  6. Make the below visualization.

    plot(Change_Me~Change_Me, data=Change_Me, pch=20, col=Change_Me(0,0,1,0.4), 
         xlab="Home size", ylab="Sales Price", axes=FALSE, xlim=c(0,6000));
    axis(side=Change_Me, at=seq(0,6000,2000));
    axis(side=2, Change_Me=seq(0,900000,300000), Change_Me=c("$0", "300k", "600k", "900k"));

  7. Make the below visualization.

    Change_Me(Change_Me~Change_Me, data=saratoga, pch=20, col=rgb(0,0,1,0.4), 
              xlab="Home size", ylab="Sales Price", axes=FALSE, xlim=c(0,6000), type="n");
    axis(side=1, at=seq(0,6000,2000));
    axis(side=2, at=seq(0,900000,300000), labels=c("$0", "300k", "600k", "900k"));
    points(saratoga$Living.Area[saratoga$Change_Me=="No"], 
           saratoga$Price[saratoga$Fireplace=="No"], 
           pch=20, col=rgb(1,0.5,0,0.4));
    Change_Me(saratoga$Living.Area[saratoga$Fireplace=="Yes"], 
              saratoga$Price[saratoga$Fireplace=="Yes"], 
              pch=20, col=rgb(0,0,1,0.4));
    Change_Me("topleft", Change_Me=c("Fireplace", "No Fireplace"), Change_Me=c("blue", "orange"), pch=c(20,20));

Homework #4

The workspace classData.Rdata contains the data frame concert. The data are the concession sales for a local concert venue. For each concert, there is data on the day of week (DOW), month, concession stand profit (Prof), concert attendance (Att), and a variable indicating if the concert took place on a weekend. For all of these problems, do not alter the original data frame or create new ones. Use the plotting functionality to set the aesthetics and annotations. Do not use the function axis in any part of this homework.

  1. Make the below visualization.

  1. Make the below visualization.

  1. Make the below visualization.

  1. Make the below visualization.

  1. Make the below visualization. Hint: use the functions lm and abline to make the line.

ggplot2 Nomenclature

Below are the important steps to building a visualization with ggplot2.

  1. Don’t forget to install and then load the library.

  2. Data: it is usually best to have all the variables in a single data frame. Specifying the name of the data frame in the ggplot function call is similar to specifying the data in a plot function call.

  3. Aesthetics: these specify the variables the geometric shape(s) will be applied to. This doesn’t have to be specified in the ggplot function call. You can specify the variables with an aes function call within a geom function call later.

  4. Geometric Shape: depends on the aesthetics specified, determines what the graph will look like.

  5. Grouping: color/fill/shape/linetype indicates groups within a given pane/panel/graph and are specified in the aes function call. Facets indicate groups with paneling.

  6. Other: labeling (axis, title), specifying colors/lines manually, customizing legends

  7. Save or export the graph. Or include automatically with R Markdown.

Like with base plots, do not attempt to build a finished product. Build them in layers.

Plots

Plots are useful for investigating the relationship between two numeric variables, and how that relationship might depend on variables in particular categorical factors (i.e. interactions).

  • Start with the initial ggplot function call.

    library(ggplot2);
    
    ggplot(customerChurn);

    Now add aesthetics.

    ggplot(customerChurn, aes(tenure, TotalCharges));

    Now add a geom.

    ggplot(customerChurn, aes(tenure, TotalCharges)) + geom_point();
  • You don’t have to specify the aesthetics inside the initial ggplot2 function call. You can specify them in the geom calls, which tends to be a bit easier to annotate later.

    ggplot(customerChurn) + geom_point(aes(tenure, TotalCharges));
  • Provide subgroup comparison within a single graph/panel through shapes and color.

    ggplot(customerChurn) + geom_point(aes(tenure, TotalCharges, color=Dependents));
    ggplot(customerChurn) + geom_point(aes(tenure, TotalCharges, shape=Dependents));

    You can add many variables here, but don’t add to many.

  • Provide subgroup comparison through paneling rather than within a single panel.

    ggplot(customerChurn) + geom_point(aes(tenure, TotalCharges, color=Dependents)) + 
      facet_wrap(~Contract);
  • You can move/rename the legend and add some labeling.

    ggplot(customerChurn) + geom_point(aes(tenure, TotalCharges, color=Dependents)) + 
      theme(legend.position="top", legend.justification="left", legend.key.size = unit(0.5, "cm")) + 
      labs(color="Has Dependents?", x="Length of time as customer", y="Total Charges",
           title="Tenure vs Total Charges by Dependent", 
           caption="made with ggplot2");
  • You can change the axis and the way it is displayed.

    ggplot(customerChurn) + geom_point(aes(tenure, TotalCharges, color=Dependents)) + 
      scale_x_continuous(breaks=seq(0,80,10)) +
      scale_y_continuous(breaks=seq(0,10000,1000), labels = paste("$", seq(0,10000,1000), sep="")) +
      theme(legend.position="top", legend.justification="left", legend.key.size = unit(0.5, "cm")) + 
      labs(color="Has Dependents?", x="Length of time as customer", y="Total Charges",
           title="Tenure vs Total Charges by Dependent", 
           caption="made with ggplot2");
  • You can define your own colors (and other aesthetics) and labels, and use transparancy.

    ggplot(customerChurn) + geom_point(aes(tenure, TotalCharges, color=Dependents), alpha=0.2) + 
      scale_x_continuous(breaks=seq(0,80,10)) +
      scale_y_continuous(breaks=seq(0,10000,1000), labels = paste("$", seq(0,10000,1000), sep="")) +
      scale_color_manual(name="Dependents?", values=c("orange", "blue"), labels=c("No", "Yes")) +
      theme(legend.position="top", legend.justification="left", legend.key.size = unit(0.5, "cm")) + 
      labs(x="Length of time as customer", y="Total Charges",
           title="Tenure vs Total Charges by Dependent", 
           caption="made with ggplot2");
  • Add a loess smoother (nonlinear line of best fit) or a linear model.

    ggplot(customerChurn) + geom_point(aes(tenure, TotalCharges, color=Dependents), alpha=0.2) + 
      geom_smooth(aes(tenure, TotalCharges, color=Dependents), method=loess, se=FALSE) +       
      scale_x_continuous(breaks=seq(0,80,10)) +
      scale_y_continuous(breaks=seq(0,10000,1000), labels = paste("$", seq(0,10000,1000), sep="")) +
      scale_color_manual(name="Dependents?", values=c("orange", "blue"), labels=c("No", "Yes")) +
      theme(legend.position="top", legend.justification="left", legend.key.size = unit(0.5, "cm")) + 
      labs(x="Length of time as customer", y="Total Charges",
           title="Tenure vs Total Charges by Dependent", 
           caption="made with ggplot2");
  • And now with a linear model. This corresponds to a model that has an interaction between the numeric predictor and the categorical predictor.

    ggplot(customerChurn) + geom_point(aes(tenure, TotalCharges, color=Dependents), alpha=0.2) + 
      geom_smooth(aes(tenure, TotalCharges, color=Dependents), method=lm, se=FALSE) +       
      scale_x_continuous(breaks=seq(0,80,10)) +
      scale_y_continuous(breaks=seq(0,10000,1000), labels = paste("$", seq(0,10000,1000), sep="")) +
      scale_color_manual(name="Dependents?", values=c("orange", "blue"), labels=c("No", "Yes")) +
      theme(legend.position="top", legend.justification="left", legend.key.size = unit(0.5, "cm")) + 
      labs(x="Length of time as customer", y="Total Charges",
           title="Tenure vs Total Charges by Dependent", 
           caption="made with ggplot2");
  • Don’t like the grid lines and grey background?

    ggplot(customerChurn) + geom_point(aes(tenure, TotalCharges, color=Dependents), alpha=0.2) +
      geom_smooth(aes(tenure, TotalCharges, color=Dependents), method=lm, se=FALSE) +       
      scale_x_continuous(breaks=seq(0,80,10)) +
      scale_y_continuous(breaks=seq(0,10000,1000), labels = paste("$", seq(0,10000,1000), sep="")) +
      scale_color_manual(name="Dependents?", values=c("orange", "blue"), labels=c("No", "Yes")) +
      theme(legend.position="top", legend.justification="left", legend.key.size = unit(0.5, "cm"), 
            panel.grid.major = element_blank(), 
            panel.grid.minor = element_blank(),
            panel.background = element_blank()) + 
      labs(x="Length of time as customer", y="Total Charges",
           title="Tenure vs Total Charges by Dependent", 
           caption="made with ggplot2");

Barcharts

Barcharts are usually the best way to visualize categorical or discrete variables.

  • First make the ggplot function call and add the geom.

    ggplot(customerChurn) + geom_bar(aes(x=Partner));

    Display proportions by specifying the y.

    ggplot(customerChurn) + geom_bar(aes(x=Partner, y=(..count..)/sum(..count..)));
  • Add information a second categorical variable to make a stacked barchart.

    ggplot(customerChurn) + 
      geom_bar(aes(x=Contract, fill=Partner, y=..count../tapply(..count.., ..x.. ,sum)[..x..]));
  • Create a side-by-side barchart using the argument position

    ggplot(customerChurn) + 
      geom_bar(aes(x=Contract, fill=Partner, y=..count../tapply(..count.., ..x.. ,sum)[..x..]), position="dodge") +
      scale_fill_manual(name = "Partner", labels = c("No Partner", "Partner"), values=c("orange", "blue")) +
      theme(legend.position="top", legend.justification="left", legend.key.size = unit(0.5, "cm")) + 
      labs(x="Contract", y="") + 
      coord_flip();

Density Estimates

Histograms and density estimates allow you to visualize the distribution of a continuous variable or a discrete variable that takes on many values (e.g. Poisson). The geometric shape for a histogram is geom_hist.

  • Use the function geom_density to make a density estimate.

    ggplot(coasters) + geom_density(aes(Age)); 
  • Coloring works much better than fill (aestetics for histograms get muddled). This is where ggplot2 is superior to hist.

    ggplot(coasters) + geom_density(aes(Age, color=Type)); 
  • You can still facilitate subgroup comparison through paneling.

    ggplot(coasters) + 
      geom_density(aes(Age)) +
      facet_wrap(~Type, nrow=1);

Boxplots

Box-and-whisker plots show the distribution of a variable with a five number summary (the whisker extensions show the last value that is not an outlier) and indicate outliers (points).

ggplot(coasters) + 
  geom_boxplot(aes(Type, Age));
ggplot(coasters) + 
  geom_boxplot(aes(Age, Type));

Multiple Graphs

Suppose you want to show multiple graphs on the same graph sheet.

First, assign the plots you want to show to objects.

p1 = ggplot(coasters) + geom_density(aes(Age, color=Type)); 
p2 = ggplot(coasters) + geom_boxplot(aes(Age, Type));

Then, pass those objects to the function grid.arrange.

library(gridExtra);
grid.arrange(p1, p2, nrow=1);

Unlike base plots, the graphics parameter controlling the number of graphs on a sheet doesn’t have to be reset.

Time Series Plots

Time series plots are basically scatter plots where the x-variable is time. Often dates/times are represented as character strings (or factors). Convert characters to dates with the function as.Date. Inside ggplot2, use the functions scale_x_date and date_format (in scales).

Consider the data frame gasOil.

  • First, we have to convert the Date from a character string to a date.

    gasOil$Date2 = as.Date(gasOil$Date, format="%Y-%m-%d");
  • Create a basic time series graph.

    ggplot(gasOil) + 
      geom_line(aes(Date2, Price));
  • We can specify how the dates look on the axis.

    library(scales);
    
    ggplot(gasOil) + 
      geom_line(aes(Date2, Price)) + 
      scale_x_date(labels=date_format("%b %y"), date_breaks="5 years", limits = as.Date(c("1977-01-01","2013-01-01")));
  • Suppose we have some model, eg a quadratic model.

    time = 1:410;
    m1 = lm(Price~time + I(time^2), data=gasOil);
  • We can add fitted values to the plot.

    ggplot(gasOil, aes(Date2)) + 
        geom_line(aes(y=Price)) + 
        geom_line(aes(y=fitted(m1))) + 
        scale_x_date(labels=date_format("%b %y"), date_breaks="5 years", limits = as.Date(c("1977-01-01","2013-01-01")));
  • We can also specify our own coloring without a grouping variable (eg observed vs fitted).

    ggplot(gasOil, aes(x=Date2)) + 
        geom_line(aes(y=Price, color="Obs", linetype="Obs")) + 
        geom_line(aes(y=fitted(m1), color="Fit", linetype="Fit")) +
        scale_color_manual(name="", values = c("Obs"="blue", "Fit"="orange")) + 
        scale_linetype_manual("", values = c("Obs"="dashed", "Fit"="dotted")) + 
        theme(legend.position="top", legend.justification="left", legend.key.size = unit(0.5, "cm")) + 
        scale_x_date(labels=date_format("%b %y"), date_breaks="5 years", limits = as.Date(c("1977-01-01","2013-01-01"))) + 
        labs(x="", y="", title="Inflation adjusted gas prices with model fitted values");

Saratoga Home Prices III

The data frame saratoga in the workspace classData.Rdata contains the sales price, number of bedrooms, and if homes have a fireplace or not. For all of these problems, do not alter the original data frame or create new ones. Use the plotting functionality to set the aesthetics and annotations.

  1. Make the below visualization.

    Change_Me(saratoga) + 
      geom_bar(Change_Me(Bedrooms, y = (..count..)/sum(..count..))) +
      scale_x_continuous(Change_Me=seq(1,7,1)) + 
      labs(x="", y="", Change_Me="Distribution of the number of bedrooms");

  2. Make the below visualization.

    ggplot(saratoga) + 
      geom_bar(aes(Bedrooms, fill=Change_Me, y=Change_Me)) +
      Change_Me(breaks=seq(1,7,1)) +
      scale_fill_manual(name = "Fireplace", Change_Me = c("No", "Yes"), Change_Me=c("orange", "blue")) +
      theme(legend.position="top", Change_Me="left", legend.key.size = unit(0.5, "cm")) + 
      labs(Change_Me="Number of bedrooms", y="", title="");

  3. Make the below visualization.

    ggplot(Change_Me) + 
      Change_Me(aes(Price)) + 
      scale_x_continuous(breaks=seq(0,900000,300000), Change_Me=paste(seq(0,900,300), "k", sep="")) +
      Change_Me(x="", y="", title="Price distribution");

  4. Make the below visualization.

    ggplot(saratoga) + 
      geom_density(aes(Price, color=Change_Me)) + 
      scale_x_continuous(Change_Me=seq(0,900000,300000), labels=paste(seq(0,900,300), "k", sep="")) +
      scale_color_manual(Change_Me = "Fireplace", labels = c("No", "Yes"), values=c("orange", "blue")) +
      theme(legend.position="top", legend.justification="left", legend.key.size = unit(0.5, "cm")) + 
      labs(x="", y="", Change_Me="Price distribution");

  5. Make the below visualization.

    ggplot(saratoga) + 
      Change_Me(aes(Change_Me, Fireplace)) + 
      Change_Me(breaks=seq(0,900000,300000), labels=paste(seq(0,900,300), "k", sep=""));

  6. Make the below visualization.

    ggplot(saratoga) + 
      Change_Me(aes(Living.Area, Price), Change_Me="blue") + 
      labs(x="Size", y="Price", title="");

  7. Make the below visualization.

    ggplot(saratoga) + 
      geom_point(aes(Living.Area, Price, color=Change_Me), Change_Me=0.25) + 
      Change_Me(aes(Living.Area, Price, color=Fireplace), Change_Me=lm, se=FALSE) + 
      Change_Me(name = "Fireplace", labels = c("No", "Yes"), values=c("orange", "blue")) +
      theme(legend.position="top", legend.justification="left", legend.key.size = unit(0.5, "cm")) + 
      labs(x="Size", y="Price", title="");

Homework #5

The workspace classData.Rdata contains the data frame concert. The data are the concession sales for a local concert venue. For each concert, there is data on the day of week (DOW), month, concession stand profit (Prof), concert attendance (Att), and a variable indicating if the concert took place on a weekend. For all of these problems, do not alter the original data frame or create new ones. Use the plotting functionality to set the aesthetics and annotations.

  1. Make the below visualization.

  2. Make the below visualization.

  3. Make the below visualization.

  4. Make the below visualization.

  5. Make the below visualization.

Homework #6

This homework is from ggplot2: Elegant Graphics for Data Analysis. The quiz will ask you to upload some visualizations.

  • 2.2.1: 2, 4

  • 2.3.1: 3

  • 2.4.1: 2

  • 2.5.1: 1, 2 (use cty for fuel economy and a linear model smoother, use your last name as the title), 3 (go through all arguments of the function)

  • 2.6.6: 2, 3, 4 (make density estimates)

Week 6

For Loops

The usual application of for loops is to iterate over the elements of an object using a sequence of integers to reference indices.

  • Vector operations reduce the need for loops

  • Use for loops for data management, merging complex data

  • Loop over the levels of one or more variables that can be anything including factors/categories

  • You can use them for simulations

  • For loops are strongly preferred over while loops

  • Do not use while loops in place of for loops

There are three parts to the function for(iterator in values)

  • iterator: object that changes according to the values
  • in: operator
  • values: values to iterate over

Here are some demonstrative examples.

  • Iterate over the values of a categorical variable.

    unique(InsectSprays$spray);
    levels(InsectSprays$spray);

    Use a sequence of integers to access those values.

    for(i in 1:length(levels(InsectSprays$spray))){
      print(levels(InsectSprays$spray)[i]);
    }

    Use the values directly!

    for(i in levels(InsectSprays$spray)){
      print(i);
    }

    Use the iterator to subset the data and compute a statistic.

    out = NULL;
    for(i in levels(InsectSprays$spray)){
        dataTemp = InsectSprays$count[InsectSprays$spray==i];
        out = c(out, sum(dataTemp));
    }

    Another way

    out = vector(length=length(levels(InsectSprays$spray)), mode="numeric");
    counter = 1;
    for(i in levels(InsectSprays$spray)){
        dataTemp = InsectSprays$count[InsectSprays$spray==i];
        out[counter] = sum(dataTemp);
        counter = counter + 1;
    }

    Yet another way

    out = vector(length=length(levels(InsectSprays$spray)), mode="numeric");
    for(i in 1:length(levels(InsectSprays$spray))){
        dataTemp = InsectSprays$count[InsectSprays$spray==levels(InsectSprays$spray)[i]];
        out[i] = sum(dataTemp);
    }

    We don’t need a for loop for stuff like this though.

    tapply(InsectSprays$count, InsectSprays$spray, sum);

    We can create and store a vector of values into a matrix.

    out = NULL;
    for(i in levels(InsectSprays$spray)){
        dataTemp = InsectSprays$count[InsectSprays$spray==i];
        out = rbind(out, c(mean(dataTemp), sd(dataTemp)));
    }
    
    out = data.frame(levels(InsectSprays$spray), out);

    Another way

    out = matrix(NA, length(levels(InsectSprays$spray)), 2);
    for(i in 1:length(levels(InsectSprays$spray))){
        dataTemp = InsectSprays$count[InsectSprays$spray==levels(InsectSprays$spray)[i]];
        out[i, ] = c(mean(dataTemp), sd(dataTemp))
    }

    We don’t need a for loop for stuff like this though.

    data.frame(tapply(InsectSprays$count, InsectSprays$spray, mean),
               tapply(InsectSprays$count, InsectSprays$spray, sd));
  • How about we iterate over a boring sequence of integers.

    out = NULL;
    for(i in 1:length(InsectSprays$count)){
      out = c(out, InsectSprays$count[i]/9);
    }

    But we can achieve this through vectorized operations.

    InsectSprays$rate = InsectSprays$count/9;

If-Else Logic

Use if/then logic for handling different cases, function control and debugging/writing code.

Key logical operators are below.

  • ==: logical comparison

    x = c(FALSE, TRUE, FALSE);
    y = c(FALSE, TRUE, TRUE);
    
    x==y;
  • !: negate

    !x;
    !y;
  • &: and, all elements of a vector

    x&y;
  • &&: and, first element only

    x&&y;
  • |: or all elements of a vector

    x|y;
  • ||: or, first element only

    x||y;

Note that when using else, the closing bracket of the initial if (or subsequent else) must be on the same line as the next else call. Try the following.

X = 1;
if(X==0){
    print("No");
} else if(X==1) {
    print("Yes");
} else {
    print("I dunno!");
}

If only accepts one logical element.

if(x){
    print("Yes");
} else {
    print("No");
}

if(x[1]){
    print("Yes");
} else {
    print("No");
}

Let’s combine our for loop and some logic to create a new categorical version of the count/area variable we created.

out = NULL;
for(i in 1:nrow(InsectSprays)){
  
  if(InsectSprays$rate[i]==0){
    temp = "none";
  }  else if( (0<InsectSprays$rate[i])&(InsectSprays$rate[i]<=1)) {
    temp = "ok";
  } else {
    temp = "good";
  }
  
  out = c(out, temp);
}

InsectSprays$result = out;

You can actually do this without a loop and if/else logic however.

result2 = rep("good", times=nrow(InsectSprays));
result2[InsectSprays$rate==0] = "none";
result2[(0<InsectSprays$rate)&(InsectSprays$rate<=1)] = "ok";

While Loops

While loops and for loops appear to be very similar because they both can be used for iteration.

  • The input to while better resembles the input to if than the input to for

  • While loops continue operations while some logical condition is TRUE and stop when that condition is FALSE

  • While loops are used when the number of iterations is not know in advance, such as with many numerical analysis or estimation methods

Lets compute a mean until it is sufficiently stable.

meanOut = faithful$waiting[1];
tol = 0.0001;
adif = 1;
counter = 2;
while(adif > tol){
  meanTemp = mean(faithful$waiting[1:counter]);
  adif = abs(meanTemp - meanOut);
  counter = counter + 1;
  meanOut = meanTemp;
}

Functions

Functions do not always have to be large containers for multiple complex tasks.

  • Good for small tasks that are repeated

  • Make code less repetitive

Writing a large, complex function can be daunting.

  • Start with smaller functions for individual tasks

  • Make them work together to achieve larger tasks

The function function is used to assign code to an object that can be called and executed.

  • Functions must be executed through the command line and they will exist as objects

  • You should comment the beginning of each function with a description function, the inputs, and the outputs (ie a header)

  • Arguments are specified inside the function definition along with default values.

Below is a basic function with two arguments.

myFun = function(arg1 = 10, arg2 = 20, ...){
  arg1 + 1;
  arg2 + 1;
}
myFun();
myFun(arg2=200);
arg1;
arg2;

What happens when you use <- assignment to define default values for arguments?

myFun = function(arg1 <- 1){
    arg1 + 1;
}
myFun();

Suppose an object is defined outside the function and not defined inside the function, but the function uses the object. Where does the definition of the object come from?

x = 1;  
myFun = function(){
    x = x+2;
    return(x);
}
myFun();

Suppose an object is defined both inside and outside the function. Which definition does it use?

x = 1;  
myFun = function(){
    x = 2;
    x + 2;
}
myFun();

Functionality Practice

  1. The data frame iris is in the library datasets. Use a for loop to iterate through the unique species. For each species, find the mean of the variables (sepal lengths, sepal widths, petal lengths and petal widths). Create a data frame that contains the means and a vector of the unique species. Give the columns meaningful names. Then display the structure of the data frame you created.

  2. Consider the data frame faithful and the variable waiting. Use a for loop to create a categorical variable called waitingCat defined by the below. Do not add this variable to the original data frame.

    • Quick: waiting time less than 45 minutes

    • Moderate: waiting time in between 45 and 82 minutes (inclusive)

    • Long: waiting time longer than 82 minutes

    After you create the variable, obtain the frequencies of each level.

  3. Recall that the function boxplot will return an object that identifies the observations that are considered outliers. Write a function that uses the function boxplot to identify and return the outliers in the input vector X. Set the boxplot specification so that no plot is rendered by default. Use your function to identify the outlying prices in the data frame legos in classData.Rdata.

  4. Recall part 7 from Homework #3. There you used tidy functionality to find the correlations by two categorical variables. Use two for loops to iterate over the levels of the categorical variables to compute and store the correlations. Your final data frame should contain the levels of the categorical variables and the correlations. Name the variables and display the structure of the data frame you create.

  5. The zip file Earthquakes 2000-2020.zip contains files on earthquakes from 2000 to 2020. Your objective is to use a for loop to iterate over the file names, summarize each file, and build a data frame with the summaries. Let’s break down the process as follows.

    1. Download and unzip the zip file. Obtain the directory and set your working directory to be this location.

    2. Read the first file in the directory, query2000.csv. Determine the number of rows and the mean/sd of the magnitudes.

    3. Use the function dir to determine the file names in the directory. Assign this to an object.

    4. Use a for loop to iterate over the file names. Read each file and determine the summary statistics you computed previously. Build an object that stores the summary statistics for each file (year). After reading/processing the all of the files, add a variable for the year. Name each variable and display the structure of the data. Finally, change the directory out of the folder with the files to the parent location.

Homework #7

  1. Use iteration to compute and store the mean and standard deviation of every variable in the mtcars data frame contained in the library datasets. Name the rows/columns of the object you create and then display the object.

  2. Use iteration and logic to create a new factor variable from the horsepower variable in the mtcars data frame contained in the library datasets, defined by the following: low (less than 125), medium (between 125 and 250), high (above 250). Display a table of the new horsepower variable.

  3. Convert the nursery rhyme “ten in the bed” to a function. Generalize it to any number of people in any sleeping structure. Call your function with 7 people in a twin bed. Hint: use the function cat to print out the message.

  4. The zip file specdata.zip contains 332 csv files corresponding to particulate matter air pollution monitors in the United States. Each file contains data from a single monitor and the ID number for each monitor is the file name. For example, data for monitor 200 is contained in the file 200.csv. In each file, missing values are denoted as NA. Each file contains three variables described below.

    • Date: the date of the observation in YYYY-MM-DD format (year-month-day)

    • sulfate: the level of sulfate PM (measured in micrograms per cubic meter)

    • nitrate: the level of nitrate PM (measured in micrograms per cubic meter)

    Download the zip file specdata.zip. Unzip the file which should create a directory named specdata that contains the files. Do not make any modifications to the files in the directory and do not save any other files in the directory.

    1. Create an R script for your analysis in a directory other than the specdata dicrectory. Determine the directory path for these two directories. Change the directory to the location of specdata.

    2. Read the file 001.csv for monitor 001.

    3. How many observations are in the data frame? Remove any rows with at least one missing value. How many observations had missing values?

    4. What is the correlation between the nitrate and sulfate levels?

    5. Use the function lm to fit a linear model that uses the sulfate level to predict the nitrate level. Use the function coef to extract the model coefficients.

    6. Write a function called getMonitor with the below inputs and outputs. Check to make sure the number of complete cases is at least two before estimating the correlation and regression coefficients. If there are not enough complete cases, return a vector of NA values for the correlation and regression coefficients.

      • Input: the filename

      • Output: a vector containing the number of complete cases, number of incomplete cases, correlation (if applicable), and the regression coefficients (if applicable)

    7. Test your function using the data for monitor 002.

    8. When you are in the specdata directory, use the function dir to determine the filenames in the directory. This gives a vector of character strings for the filenames.

    9. Iterate through the file names and use your function to read/process each file. Store the output returned by the function in each iteration. Create a data frame that contains the station filenames and output. Change the row and column names.

    10. Use your file summaries to obtain each of the below.

      • Numerical summaries of the variables

      • Density estimates for the number of complete cases, number of missing values, correlations, intercepts and slopes

      • Plot of the number of complete cases vs correlation

Week 7

Character Data

Below is a summary of helpful functions for managing character data.

  • nchar: finds the number of characters in a string
  • substring: extracts a substring from a string defined by a starting and end location
  • tolower/toupper: converts upper/lower case to lower/upper case
  • sub: replaces the first instance of a specified string with a replacement string
  • gsub: replaces all instances of a specified string with a replacement string
  • strsplit: splits a character string using a specified character value
  • unlist: converts the elements of a list to a vector

Below is a vector of character strings to play with.

phrases = c("Statistical Computing With R", "STAT 331", "stat", "Stat", "SSSTAT", "STAT STAT", "STATISTICS", "805.756.1111", "#805", "800-123-0805", "I like teaching STAT 331");
  • Use the function nchar to determine the number of characters in a given string.

    nchar(phrases[1]);
    nchar(phrases);
  • Use the function substring to extract the first four characters in a given string.

    substring(phrases[1], 1, 4);
    substring(phrases, 1, 4);
  • The function sub only replaces the first instance of each string. Use the function gsub to replace all instances of the string “STAT” with “DATA” in the vector phrases.

    phrases[6];
    
    sub("STAT", "DATA", phrases[6]);
    gsub("STAT", "DATA", phrases[6]);
    
    sub("STAT", "DATA", phrases);
    gsub("STAT", "DATA", phrases);
    
    gsub("STAT", "DATA", phrases, ignore.case = TRUE);
  • Use the function strsplit to split the character strings using a ” ” split character.

    strsplit(phrases[1], split=" ");
    
    strsplit(phrases[1], split=" ")[[1]];
    unlist(strsplit(phrases[1], split=" "));

Regular Expressions

Below is a description of the functions for identifying patterns.

  • grepl: gives logical indicators of a matching pattern
  • grep: gives the location of a matching pattern
  • regexpr: gives the location of the first matched pattern, also gives the length of the matched pattern
  • gregexpr: returns the same information as regexpr but in a list form, shows the location of each match and the length of the match
  • charmatch and pmatch: used for matching and partial matching

Below is a description of operators and character classes.

  • \(\left[ \right]\): forms a character class to be searched
  • \(\left[a-z\right]\): lower-case letters
  • \(\left[A-Z\right]\): upper-case letters
  • \(\left[A-z\right]\): both
  • \(\left[0-9\right]\): digits
  • \([A-z0-9]\): alphanumeric
  • (): groups patterns
  • |: or operator
  • \: escapes the special meaning of any metacharacters

In order for some characters to be interpreted literally, you have to escape the pattern so it is not part of the string-specifying functionality. After escaping there is an escape character to reference. Some common escape sequences are listed below.

  • \.: period
  • \’: a single quote
  • \n: a new line
  • \r: carriage return
  • \t: tab

Positioning specifications are listed below.

  • ^: matches to the beginning of a string
  • $: matches to the end of a string

Quantifiers that specify repetition of patterns are listed below.

  • *: matches 0 or more occurrences of preceding entity
  • ?: matches 0 or 1 occurrences of preceding entity
  • +: matches 1 or more occurrences of preceding entity
  • {n}: matches exactly n occurrences of preceding entity
  • {n,}: matches at least n occurrences of preceding entity
  • {n,m}: matches at between n and m occurrences of preceding entity

Below is a vector of character strings to play with.

phrases = c("Statistical Computing With R", "STAT 331", "stat", "Stat", "SSSTAT", "STAT STAT", "STATISTICS", "805.756.1111", "#805", "800-123-0805", "I like teaching STAT 331");
  • Use the function grepl to search for instances of the phrase “STAT” in the phrases.

    grepl("STAT", phrases);
    phrases[grepl("STAT", phrases)];
    
    grepl("STAT", phrases, ignore.case=TRUE);
    phrases[grepl("STAT", phrases, ignore.case=TRUE)];
  • The other functions give increasingly more detailed output: grep, regexpr and gregexpr.

    grep("STAT", phrases);
    regexpr("STAT", phrases);
    gregexpr("STAT", phrases);
  • Use the function grep to search for instances of “STAT” anchored to the beginning of the target strings.

    phrases[grep("^STAT", phrases)];
  • Search for instances of “STAT” followed by a space then any number.

    phrases[grep("STAT [0-9]", phrases)];
  • Search for instances of three numbers followed by a period or dash.

    phrases[grep("[0-9][0-9][0-9](\\.|-)", phrases)];

Web Scraping

If the data are tabular and well-structured, it is easy enough to use the read.TYPE functions for reading the data through a network.

A more brute-force method approach is to use the function readLines.

  • This literally reads data line by line (that you later have to decode and manage)
  • It can also be used to parse pure XML or HTML code (that you later have to decode and manage)
link = "https://www.wunderground.com/hurricane/archive/AL/2020/Hurricane-Laura/AL132020";
page = readLines(link);

More tame methods use packages that do decoding and look for tables. Suppose you want tracking data on Hurricane Laura.

# one way
library(httr);
library(XML);
link = "https://www.wunderground.com/hurricane/archive/AL/2020/Hurricane-Laura/AL132020";
page = GET(link);
data = readHTMLTable(doc=content(page,"text"));
data = data[[1]];

# another way
library(rvest);
link = "https://www.wunderground.com/hurricane/archive/AL/2020/Hurricane-Laura/AL132020"
data = read_html(link);
data = html_table(data, fill=TRUE);
data = data[[1]];

# another way but some windows users have trouble with this
library(RCurl);
library(XML);
link = "https://www.wunderground.com/hurricane/archive/AL/2020/Hurricane-Laura/AL132020"
page = getURL(link);
data = readHTMLTable(page);
data = data[[1]];

Maps

The primary function for creating maps with base functionality is the function map in the library maps. The function plots polygons from a database that is a collection of polygons that each define specific regions. Commonly used arguments are below.

  • database: world, usa, state, county
  • regions: specify specific regions
  • fill: specify TRUE if filling in regions with color
  • col: colors to fill in the regions
  • xlim, ylim: set the longitude and latitude limits

You can use the functions points or text to add points or text to a map by specifying the longitude (x) and latitude (y).

  • Load the library and make a map of the world.

    library(maps);
    map("world");
  • Make a map of the USA.

    map("usa");
  • Make a map of the USA with state borders.

    map("state");
  • The data frame us.cities contains some information on major US cities. Make a plot of the USA with these locations.

    map("state");
    points(us.cities$long, us.cities$lat, pch=20);
  • Suppose you want California, you need to know the region(s).

    mapInfo = map("state", plot=FALSE);
    str(mapInfo);
    mapInfo$names;
  • Identify and specify the name of the region.

    map("state", regions="california");
  • Make a map of the California counties.

    map("county");
    map("county", regions="california");
  • Make a map of the track of Laura.

    # read data
    library(httr);
    library(XML);
    library(maps);
    link = "https://www.wunderground.com/hurricane/archive/AL/2020/Hurricane-Laura/AL132020"
    page = GET(link);
    data = readHTMLTable(doc=content(page,"text"));
    data = data[[1]];
    
    # get rid of spaces in names
    names(data) = gsub(" ", "", names(data));
    
    # convert to numeric
    data$Lat = as.numeric(data$Lat);
    data$Lon = as.numeric(data$Lon);
    
    # create some categories
    data$StormType[grep("Category 1", data$StormType)] = "Category 1";
    data$StormType[grep("Category 2", data$StormType)] = "Category 2";
    data$StormType[grep("Category 3", data$StormType)] = "Category 3";
    data$StormType[grep("Category 4", data$StormType)] = "Category 4";
    data$StormType[grep("Tropical", data$StormType)] = "Other";    
    
    # now give them some color
    myCol = rep("grey", times=nrow(data));
    myCol[data$StormType=="Category 1"] = "lightblue";
    myCol[data$StormType=="Category 2"] = "blue";
    myCol[data$StormType=="Category 3"] = "darkblue";
    myCol[data$StormType=="Category 4"] = "orange";
    
    # now make a map
    map("world", xlim=range(data$Lon)+c(-30, 30), ylim=range(data$Lat)+c(-10, 10));
    points(data$Lon, data$Lat, pch=20, col=myCol);
    mtext(text=link, side=1);
    legend(x="topright", legend=c("Other", "Category 1", "Category 2", 
                                  "Category 3", "Category 4"), 
           col=c("grey", "lightblue", "blue",
                 "darkblue", "orange"), pch=20);

Election Results

I found some data on the election results by county in California.

# load packages;
library(maps);
library(httr);
library(XML);

# get the data
link = "https://en.wikipedia.org/wiki/2020_United_States_presidential_election_in_California";
page = GET(link);
data = readHTMLTable(doc=content(page,"text"));

# look through the data structure to identify the table
data = data[[79]];

# look through the rows and columns to identify the ones we need
head(data);
tail(data);

# subset this to get rid of rows we don't want, keep columns we do want
data = data[-c(1,2), c(1,4)];

# manage the data a bit
names(data) = c("County", "PCT");
data$PCT = as.numeric(gsub("%", "", data$PCT));

# make the basic map, get the region names
# notice the region names in the map are 1-to-1 with the names in the data
a = map("county", "california", plot=FALSE);

# give regions in our DF colors
data$col = "white";
data$col[(data$PCT>=30)&(data$PCT<40)] = "grey";
data$col[(data$PCT>=40)&(data$PCT<50)] = "lightblue";
data$col[(data$PCT>=50)&(data$PCT<60)] = "blue";
data$col[data$PCT>=60] = "darkblue";

# commit an act of journalism
map("county", "california", fill=TRUE, col=data$col);
title(main="Percentage of Trump votes in the 2020 election",
      xlab="https://en.wikipedia.org/wiki/2020_United_States_presidential_election_in_California");
legend("topright", 
       legend=c(">= 60%", "[50%,60%)", "[40%,50%)", "[30%,40%)", "<30%"), 
       col=c("darkblue", "blue", "lightblue", "grey", "white"), 
       pch=c(15, 15, 15, 15, 15));

# if the names are not 1-to-1 you could do the below
myCol = rep("white", times=length(a$names));
for(i in 1:nrow(data))
{
  myCol[grep(data$County[i], a$names, ignore.case = TRUE)] = data$col[i];
}

map("county", "california", fill=TRUE, col=myCol);
title(main="Percentage of Trump votes in the 2020 election",
      xlab="https://en.wikipedia.org/wiki/2020_United_States_presidential_election_in_California");
legend("topright", 
       legend=c(">= 60%", "[50%,60%)", "[40%,50%)", "[30%,40%)", "<30%"), 
       col=c("darkblue", "blue", "lightblue", "grey", "white"), 
       pch=c(15, 15, 15, 15, 15));

Act of Journalism Practice

  1. Gasoline sold in California has to be produced specifically for California. I was curious about how much of California’s oil production comes from each county. I found some data.

    # load some packages
    library(httr);
    library(XML);
    library(ggplot2);
    
    # get some data 
    link = "http://www.drillingedge.com/california";
    page = Change_Me(link);
    data = Change_Me(doc=content(page,"text"));
    
    # combine the two data sources, subset the columns and rename
    data = rbind(data[[Change_Me]], data[[Change_Me]]);
    data = data[,1:2];
    names(data) = c("county", "oil");
    
    # manage the data a bit
    data$county = gsub(" County, CA", "", data$county);
    data$oil = Change_Me(" BBLs", "", data$oil);
    data$oil = as.numeric(gsub(",", "", data$oil));
    
    # create some percentages
    data$oilp = data$oil / Change_Me(data$oil)*100;
    data$oilpt = Change_Me(round(data$oilp, digits=2), "%", sep="");
    
    ### for plotting: subset the data to only have the top 10
    pdata = data[order(data$oilp, Change_Me=TRUE)[1:Change_Me],];
    pdata$county = factor(pdata$county);
    
    # make a plot using the subset
    ggplot(Change_Me) + 
      geom_bar(aes(x=Change_Me(county, oil), y=Change_Me), stat="identity") +
      scale_y_continuous(breaks=seq(0,10000000, 8000000), labels = scales::comma, Change_Me=c(0, 8000000)) + 
      Change_Me_text(aes(x=reorder(county, oil), y=oil, label=Change_Me), hjust=-0.05, size=4) + 
      Change_Me(x="", y="Total Oil Production (BBLs)", title="Top 10 Oil Producing Counties in California", 
                subtitle = paste("All other counties account for ", Change_Me(100-sum(pdata$oilp), digits=2), "% of total production", sep=""), caption = "http://www.drillingedge.com/california") +
      coord_flip();
  2. Now use the same data to make a map.

    # load some packages
      library(httr);
    library(XML);
    library(ggplot2);
    library(maps);
    
    # get some data 
    link = "http://www.drillingedge.com/california";
    page = GET(link);
    data = readHTMLTable(doc=content(page,"text"));
    
    # combine the two data sources, subset the columns and rename
    data = rbind(data[[4]], data[[5]]);
    data = data[,1:2];
    names(data) = c("county", "oil");
    
    # manage the data a bit
    data$county = gsub(" County, CA", "", data$county);
    data$oil = gsub(" BBLs", "", data$oil);
    data$oil = as.numeric(gsub(",", "", data$oil));
    
    # create some percentages
    data$oilp = data$oil / sum(data$oil)*100;
    
    ### for plotting: give each county in the data frame a color
    data$col = "white";
    data$col[data$oilp>=10] = "darkblue";
    data$col[(data$oilp>=5)&(data$oilp<10)] = "blue";
    data$col[(data$oilp>=1)&(data$oilp<5)] = "lightblue";
    data$col[(data$oilp>0)&(data$oilp<1)] = "grey";
    
    # create a vector of colors for each region in the map
    a = map("county", "california", plot=FALSE);
    myCol = rep("white", times=length(a$names));
    for(i in 1:nrow(data))
    {
      myCol[grep(data$count[i], a$names, ignore.case = TRUE)] = data$col[i];
    }
    
    map("county", "california", fill=TRUE, col=myCol);
    title(main="California Oil Production by County", 
          xlab="http://www.drillingedge.com/california");
    legend("topright", 
           legend=c(">= 10%", "[5%,10%)", "[1%,5%)", "(0%,1%)", "0%"), 
           col=c("darkblue", "blue", "lightblue", "grey", "white"), 
           pch=c(15, 15, 15, 15, 15));

Homework #8

  1. What is happening with the field production of CA crude oil? I found some data. Webscrap (directly) and make the below visualization.

  2. How much domestic oil comes from the lower 48? I found some data. You can ignore the rows for the U.S. total, PADDs, Federal Offshore PADDs, and Alaska/South Alaska/North Slope. Compute the percentage of oil from each state produced in the lower 48. The colors used here are white, grey, lightblue, blue and darkblue. Webscrap (directly) and make the below visualization.

  3. Where does the gas made in California come from? I found some data . Webscrap (directly) these data. Here, we want to focus on the refineries that make CARB gasoline (not diesel). Among these refineries, determine the percentages of BPD that is processed. The file mapLocations.csv contains the location of cities in CA. Create a map of CA that shows the location of each refinery.

The Binomial Distribution

The Binomial distribution variable begins with Bernoulli trials.

\(x = \left\{\begin{array}{cc} 1 & \text{"success"} \\ 0 & \text{"failure"} \\ \end{array}\right.\)

The PMF for X is fairly simple.

  • \(P(X=0) = 1-p\)
  • \(P(X=1) = p\)

Suppose you have \(n\) Bernoulli trials \(X_{1}, \dots, X_{n}\) that are independent each with the same probability of success \(p\).

A Binomial random variable, \(X\), is the number of successes in the \(n\) trials and the sum of the Bernoulli trials, \(X= \displaystyle \sum_{i=1}^{n} X_{i}\).

The PMF is given by the below.

\(P(X=x) = {n \choose x} p^{x} (1-p)^{n-x}\) for \(x=0,1, \dots ,n\)

The R functionality is below.

  • dbinom: the PMF, \(P(X=x)\)

  • pbinom: the CDF, \(P(X \leq x)\)

  • rbinom: random number generation

Example: In the United States, it is reported that 40% of adults have college degrees. Suppose 50 adults are randomly selected and we are interested in the number of those with college degrees.

  1. Obtain the PMF.

    x = 0:50;
    dbinom(x, size=50, prob=0.40);
  2. Make a barplot of the PMF.

    pmfX = dbinom(x, 50, 0.40);
    barplot(pmfX, names.arg=x, xlab="X", ylab="PMF", main="");
  3. What is the probability none of the adults have a college degree?

    dbinom(0, 50, 0.40);
  4. What is the probability no more than twenty adults have a college degree?

    pbinom(20, 50, 0.40);
  5. What is the probability between twenty and thirty adults have a college degree (inclusive).

    pbinom(30, 50, 0.40) - pbinom(19, 50, 0.40);

The Poisson Distribution

The Binomial model is for the number of successes in \(n\) fixed trials. The Poisson distribution is for the number of “successes” within a fixed interval of time or space, and may be be any positive integer \(0, 1, \dots\).

  • The number of successes in nonoverlapping intervals of time or space are independent of one another

  • The probability of a success is the same for each interval

The PMF is derived from the Binomial distribution.

\(P(X=x) = \frac{\lambda^{x} e^{-\lambda}}{x!}\) for \(x=0, 1, \dots\)

The R functionality is below.

  • dpois: the PMF, \(P(X=x)\)

  • ppois: the CDF, \(P(X \leq x)\)

  • rpois: random number generation

Example: The manager of a Starbucks believes that the typical Starbucks customer averages 3 visits to the store over a 5-day month.

  1. Obtain the PMF.

    x=0:15;
    dpois(x, lambda=3);
  2. Make a barplot of the PMF.

    pmfX = dpois(0:15, 3);
    barplot(pmfX, names.arg=x, xlab="X", ylab="PMF", main="");
  3. What is the probability that a customer visits the chain five times in a 5-day period?

    dpois(5, 3);
  4. What is the probability that a customer visits the chain no more than two times in a 5-day period?

    ppois(2, 3);
  5. What is the probability that a customer visits the chain at least three times in a 5-day period?

    1-ppois(2, 3);

The Normal Distribution

The Normal distribution is a continuous distribution. It allows for any real value and an infinite number of possible values.

The PDF is given by the below.

\(f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{ \frac{-1}{2 \sigma^{2}}(x-\mu)^{2}}\) for any real \(x\)

The R functionality is below.

  • dnorm: the PDF, \(f(x)\)

  • pnorm: the CDF, \(P(X \leq x)\)

  • qnorm: quantiles/percentiles \(x_{p}\) so that \(P(X \leq x_{p}) = p\)

  • rnorm: random number generation

Example: The owners of a taco truck believe the amount of beef used in a day has a Normal distribution with mean of 24 pounds and a standard deviation of 5 pounds.

  1. Make a graph of the distribution.

    x = rnorm(n=1000, mean=24, sd=5);
    xPdf = dnorm(x, 24, 5);
    plot(x, xPdf);
  2. Lets try and make a line.

    plot(x, xPdf, type="l");
  3. R connect points in the order of observation, it does not sort the x values and put the y-values in the order of the sorted x values.

    plot(sort(x), xPdf[order(x)], type="l");
  4. Create a sequence of values to make the graph.

    x = seq(9, 39, .01);
    xPdf = dnorm(x, 24, 5);
    plot(x, xPdf, type='l');
  5. What is the probability that the amount of beef used is less than 10.25 pounds?

    pnorm(10.25, 24, 5);
  6. What is the probability that the amount of beef used is more than 10.25 pounds?

    1-pnorm(10.25, 24, 5);
  7. What is the probability the amount of beef used is between 24 and 30 pounds?

    pnorm(30, 24, 5) - pnorm(24, 24, 5);
  8. The bottom 10% of beef usage is less than what amount?

    qnorm(0.10, 24, 5);
  9. The top 10% of beef usage is more than what amount?

    qnorm(0.90, 24, 5);

The Exponential Distribution

The Exponential distribution is for the length of time in between successes.

The PDF is given by the below.

\(f(x) = \lambda e^{-\lambda x}\) for \(x \geq 0\)

The R functionality is below.

  • dexp: the PDF, \(f(x)\)

  • pexp: the CDF, \(P(X \leq x)\)

  • qexp: quantiles/percentiles \(x_{p}\) so that \(P(X \leq x_{p}) = p\)

  • rexp: random number generation

Example: Let the time between e-mail messages during work hours be Exponentially distributed with a rate of 1 every five minutes.

  1. Make a plot of the PDF.

    x = seq(0, 20, .1);
    pdfX = dexp(x=x, rate=0.20);
    plot(x, pdfX, type="l");
  2. What is the probability you will get an email within 10 minutes?

    pexp(10, 0.20);
  3. What is the probability you will not get an email for more than 20 minutes?

    1 - pexp(20, 0.20);
  4. What is the 90th percentile of the arrival times?

    qexp(0.90, 0.20);

Probability Practice

  1. A construction company in Florida is selling condominiums. Assume the prices have a Normal distribution with a mean of $510,000 and standard deviation of $10,000.

    1. Make a graph of the distribution of the sales prices.

    2. What is the probability a condo sells for more than $515,000?

    3. What is the probability a condo sells for between $490,000 and $500,000?

    4. What is the probability a condo sells for exactly $510,000?

    5. The bottom 5% of sales prices are below what value?

  2. According to a survey by Transamerica Center for Health Studies, 15% of Americans still have no health insurance. Suppose five individuals are randomly selected.

    1. Make a graph of the distribution of the number of individuals with no health insurance.

    2. What is the probability that all five have health insurance?

    3. What is the probability that no more than two have health insurance?

    4. What is the probability that at least four have health insurance?

  3. Suppose the average price of electricity for a New England customer follows the continuous uniform distribution with a lower bound of 12 cents per kilowatt-hour and an upper bound of 20 cents per kilowatt-hour. What is the probability a price is more than 14 cents per kilowatt-hour? Hint: use the function punif.

Week 8

Simulation

  • We can simulate the distribution of strange random variables, e.g. Bin(2, 0.5) + Bin(3, 1/3) + Bin(4, 1/4).

    myBin = rbinom(1000, 2, 0.5) + rbinom(1000, 3, 1/3) + rbinom(1000, 4, 1/4);
    
    prop.table(table(myBin));
    mean(myBin);
    sd(myBin);

    We can use this to answer specific probability questions, such as \(P(myBin > 2)\).

    sum(myBin > 2) / 1000;
  • Another useful function is the function sample which allows you to sample with or without replacement from an input vector.

    sample(1:6, 6, replace=TRUE);
    sample(1:6, 6, replace=TRUE, prob=c(0.01, 0.01, 0.01, 0.01, 0.01, 0.95));
    
    sample(1:6, 6, replace=FALSE);
  • Example: Suppose there is a group of 50 people. Simulate the approximate probability at least two people have the same birthday (ie the same day of the year not considering year of birth; also ignore the leap year). Use 10000 simulations.

      simNum = 10000;
      sameBeeDay = NULL;
      for(i in 1:simNum){
          bDays = sample(1:365, size=50, replace=TRUE);
          sameBeeDay = c(sameBeeDay, sum(table(bDays)>=2)>0);
      }
      sum(sameBeeDay) / simNum;
  • Example: Consider a standard deck of cards. Here we will assume the numeric values of the cards range from 1 (Ace) to 13 (King). What is the probability of getting five cards of the same suit (ie the flush)? Use 10000 simulations.

    face = rep(c("Ace", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten",
                 "Jack", "Queen", "King"), times=4);
    suit = rep(c("Heart", "Diamond", "Club", "Spade"), each=13);
    value = rep(1:13, times=4);
    color = rep(c("Red", "Black"), each=26);
    deck = data.frame(face, suit, value, color);
    
    flush = NULL;
    simNum = 10000;
    for(i in 1:simNum){
        cardNo = sample(1:52, size=5, replace=FALSE);
        cards = deck[cardNo,];
        flush = c(flush, length(unique(cards$suit))==1);
    }
    mean(flush);
  • Example: Perhaps you have seen the Random Babies applet? Suppose one night at a hospital four mothers each give birth to a baby. The hospital is not very organized and looses track of which mother gave birth to each baby, and essentially returns the babies to the mothers at random. Here, we are interested in the number of babies that are correctly returned to their respective mothers. Simulate the distribution of the number of babies that are correctly returned. Use 10000 simulations. Create a barchart showing the simulated distribution.

    babyMama = 1:4;
    numCorrect = NULL;
    simNum = 10000;
    for(simItor in 1:simNum)
    {
        returnedBaby = sample(babyMama, 4, replace=F);
        numCorrect = c(numCorrect, sum(babyMama == returnedBaby));
    }
    
    myTab = prop.table(table(numCorrect));
    myTab = c(myTab[1:3], 0, myTab[4]);
    names(myTab)[4] = "3";
    
    barplot(myTab);

Simulation Practice

  1. A professor has written an exam. Fifty of the questions are multiple choice with five choices each, and the remaining 50 questions are true/false. You are not prepared for the exam and are going to guess on each of the questions. You get three points for each correct multiple choice question and two points for each correct true/false question. There is a 5% chance you will get 5 bonus points. Simulate 10000 scores of the exam.

    1. Approximately, what is the mean and standard deviation of the scores on the exam?

    2. What is the approximate probability someone receives a score of 100 points or better?

    3. Create a histogram of the simulated scores.

  2. Consider a standard deck of cards. Here we will assume the numeric values of the cards range from 1 (Ace) to 13 (King).

    1. Randomly sample two cards from the deck without replacement.

    2. Determine if the sum of the two cards is 19 or more.

    3. Simulate 10000 draws of two cards and their resulting two-card sums.

    4. Make a histogram of the sum values.

    5. What is the approximate probability the sum of the values is 19 or more?

  3. Data Science Interview Question: There are two boxes. One box has 10 black cards and 10 red cards. The second box has 100 black cards and 100 red cards. Suppose you sample 2 cards without replacement from each box. Which box has the higher probability of drawing two cards with the same color? Use 10000 simulations to compute approximate probabilities to answer this question.

Lingo

Lingo is a multistage television game show played by two teams of two individuals (watch the first ten minutes of the video). The main game is played in two stages. Each team is given a Lingo game card with 25 slots arranged in five rows and five columns. The Lingo game card is initialized to have 10 slots randomly filled, constrained so that there is no more than three slots selected in any row, column, or diagonal, and the card is not already a winning card.

In the first stage of play, a five letter word is randomly generated. The first letter is revealed, and the playing team has five opportunities to guess the correct word. If the team guesses the word correctly, they are awarded cash ($25) and two random draws to fill in their Lingo game card.

The second stage of play consists of the two random draws to fill in the Lingo game card. Lingo (like Bingo) is achieved when the Lingo game card has five slots in a row, column, or diagonal (including the reverse diagonal). The team receives a larger cash reward ($50). A new Lingo game card is initialized and play continues. The team with the most cash progresses to final Lingo. The second stage of play is a vital aspect of a team’s success. However, the second stage of play is purely random and can be simulated.

We will consider a relaxed version of the Lingo game card and random draws. We will consider the 25 slots arranged in five rows and five columns. For initialization, we will only require that the 10 randomly filled slots be constrained so that there are two slots selected in each row; and the game card is not initialized as a winning game card. We will assume that the randomly generated draws have no dead balls, prizes, or loss of turn.

We want to use simulation to determine how many Lingo random draws (of two balls) are required before Lingo is achieved on the Lingo game card. Recall that each set of two random draws is obtained by correctly identifying the randomly generated word in the first stage. So, this will give an indication of the number of correctly identified words which are required to obtain Lingo. Your simulation should return the simulated distribution of the number of random draws required to obtain lingo and the mean of this simulated distribution. Use 5000 simulations.

Homework #9

  1. A professor never dismisses class early and runs over uniformly between 0 and 5 minutes (continuously). What is the exact probability the professor runs over by more than 2 minutes and 15 seconds?

  2. The time required to assemble an electronic component is Normally distributed with a mean and a standard deviation of sixteen minutes and four minutes respectively. Suppose ten components are being assembled. What is the probability that one of the ten components will be assembled is less than ten minutes?

  3. Pick 5 cards from a standard deck of cards without replacement. Simulate the probability of obtaining a full house (three of one face, two of another face). Use 10000 simulations.

  4. 7-11-21 is a scratcher. The ticket has three boxes. Each box contains three numbers which are randomly sampled with replacement from the integers 1-13. For a given box, cash prizes are awarded when the sum of the three numbers is either seven, eleven, or twenty-one. It is possible that there are multiple winning boxes on a single ticket.

    1. Simulate the distribution of the possible sum values and make a histogram of the simulated distribution. On the histogram, indicate the winning sum values. Use 10000 simulations.

    2. Create a table (and barchart) for the distribution of the winning sum values (include a category for no winner).

    3. Create a table (and barchart) for the distribution of the number of winning boxes per ticket.

    4. Suppose the dollar amounts associated with each winning value are $1 for a 21 (a free ticket), $5 for a 11, $10 for a 7, and $0 for anything else. Make a table (and barchart) for the distribution of the win amounts per ticket and determine the average amount won per ticket.

  5. For series game play, the winner is determined by the the outcomes of play in a best-of-seven game series. Suppose teams A and B are in the series. Games one and two are played at team A’s stadium; games three, four, and five (if needed) are played at team B’s stadium; and games six (if needed) and seven (if needed) are played at team A’s stadium. Historical evidence of play between teams A and B indicates that \(P\left(\text{A Win} | \text{A Home} \right) = 0.65\) and \(P\left(\text{A Win} | \text{B Home} \right) = 0.55\). Assume the winner of each game is independent of the previous games. Simulate the series play keeping track of how many games are required to determine a winner and who the winner is. Use 10000 simulations.

    1. What is the approximate probability that team A wins the series?

    2. What is the distribution of the number of games required to determine a winner?

Week 9

Distribution of the Sample Average

Here is a fun simulation applet that samples data from a given population and will simulate the distribution of the sample average.

The data are a SRS of numeric (quantitative) data.

  • \(y_{1}, ..., y_{n}\)

  • \(n\) is the sample size or number of observations

  • The observations are independent of one another

  • The data come from any population

    • Mean \(\mu\)

    • Standard deviation \(\sigma\)

The Central Limit Theorem

  • If n is sufficiently large, then the sample mean/average \(\overline{Y}\) has an approximate Normal distribution

  • The mean is \(\mu\)

  • The standard deviation is \(\frac{\sigma}{\sqrt{n}}\)

  • This approximation usually works when \(n \geq 30\)

  • The CLT is about the distribution of the sample average/mean and not the data

EDA for a Single Numeric Variable

My brother-in-law works for a local law enforcement agency. For vehicles traveling on Highway 46 east of Paso Robles, he measured their speeds using his radar. The speeds are in the data frame highwaySpeeds.

The data are a SRS of numeric (quantitative) data.

  • \(y_{1},...,y_{n}\)

  • The observations are independent of one another

  • For statistical inference, we have to assume the values have a Normal distribution. If the values don’t, T methods serve as a good approximation when the sample size is at least 30.

Below are the critical components of an EDA.

  1. Numerical statistics: the mean, median, standard deviation, a summary and the sample size

  2. Density Estimate or Histogram: what is the shape of the distribution of the data? Does it look Normal? Is it skewed? Are there multiple peaks or modes?

  3. Boxplot: is the distribution skewed? Are there outliers? If there are, do not remove them for now.

  4. Normality Test: there is a goodness-of-fit test given by the the below hypothesis.

    \[\begin{eqnarray*} H_{0} & : & \text{Data comes from a Normal population} \\ H_{1} & : & \text{Data does not come from a Normal population} \end{eqnarray*}\]

    shapiro.test(highwaySpeeds$Speed);

    Note that this function only allows for an input vector of no more than 2000 observations. If you have more, just take a random sample.

    Another option is to use the function ks.test, but this test doesn’t like ties.

    ks.test((highwaySpeeds$Speed-mean(highwaySpeeds$Speed))/sd(highwaySpeeds$Speed), "pnorm");

The T Distribution

We assume that the data are sampled from a Normal population.

  • This is one reason for the diligent EDA

  • But, what happens when the data do not come from a Normal population?

  • The Central Limit Theorem: if n is sufficiently large (\(n \geq 30\)), then \(\overline{Y}\) has an approximate Normal distribution with mean \(\mu\) and standard deviation \(\frac{\sigma}{\sqrt{n}}\)

  • Standardize the sample average \(Z = \frac{\overline{Y}-\mu}{\frac{\sigma}{\sqrt{n}}}\)

  • \(Z\) will have an approximate Normal distribution with a mean of 0 and standard deviation 1

  • This gives the z with \(\sigma\) confidence interval for a population mean: \(\overline{y} \pm z \frac{\sigma}{\sqrt{n}}\)

The problem with the statistic and method is that \(\sigma\) is not known.

  • Use the sample standard deviation \(s\) in place of the population standard deviation \(\sigma\)

  • This is the z with s confidence interval for a population mean: \(\overline{y} \pm z \frac{s}{\sqrt{n}}\)

  • The Z method doesn’t account for the extra variation when using s in place of \(\sigma\)

  • The method will not work for small sample sizes

The T distribution/statistic accounts for the extra variation when using \(s\) in place of \(\sigma\).

  • There are two sources of variability in the statistic: \(\overline{Y}\) and \(s\)

  • \(T = \frac{\overline{Y}-\mu}{\frac{s}{\sqrt{n}}}\)

  • It accounts for this variation by allowing for more variation in the tails of the distribution

  • There is more area under the curve in the tails and less area in the center of the distribution

  • Fat or heavy tails

  • The T distribution’s parameter is the degrees of freedom \(df=n-1\)

data = seq(-4, 4, 0.01);

normDataPdf = dnorm(data);
tData1Pdf = dt(data, 1);
tData5Pdf = dt(data, 5);
tData20Pdf = dt(data, 20);

plot(data, normDataPdf, type="l", xlab="", ylab="", main="Z vs. Various T");
points(data, tData1Pdf, type="l", lty=2, col=2);
points(data, tData5Pdf, type="l", lty=3, col=3);
points(data, tData20Pdf, type="l", lty=4, col=4);
legend("topright", c("Z", "T with df=1", "T with df=5", "T with df=20"), 
    lty=c(1,2,3,4), col=c(1,2,3,4));

One-Sample T Confidence Interval

Here is a fun simulation applet that randomly generates confidence interals for a population mean.

A \(100(1-\alpha)\%\) confidence interval for the population mean is based on T and not Z.

  • \(\overline{y} \pm t\frac{s}{\sqrt{n}}\)

  • \(t=t_{df, \frac{\alpha}{2}}\) is the \(1-\frac{\alpha}{2}\) percentile of \(T_{df}\)

  • \(df=n-1\)

Use the function t.test to obtain a confidence interval.

  • \(x\): data

  • conf.level: confidence level

t.test(highwaySpeeds$Speed);

Bootstrapped Confidence Interval

Bootstrapping entails sampling the sample with replacement to simulate the distribution of a statistic. It can be used to obtain an approximate confidence interval in situations when the parametric (eg T) methods are not suitable.

  • Sample the original vector with replacement n times

  • Store the value of the resulting sample average

  • Then find the middle 95% of the values

allTheMeans = NULL;
simNum = 1000;
for(i in 1:simNum){
  sampleTemp = sample(highwaySpeeds$Speed, size=length(highwaySpeeds$Speed), replace=TRUE);
  allTheMeans = c(allTheMeans, mean(sampleTemp));
}

hist(allTheMeans);

quantile(allTheMeans, probs=c(0.025, 0.975));

Hypothesis Test for a Single Population Mean

Let \(\mu\) be the unknown population mean and \(\mu_{0}\) be the hypothesized mean.

Two-tailed Right-tailed Left-tailed
\(H_{0}:\mu=\mu_{0}\) \(H_{0}: \mu= (\leq) \mu_{0}\) \(H_{0}: \mu=(\geq) \mu_{0}\)
\(H_{1}:\mu \ne \mu_{0}\) \(H_{1}: \mu > \mu_{0}\) \(H_{1}: \mu < \mu_{0}\)
\(2P(T>|t|)\) \(P(T>t)\) \(P(T<t)\)

Table: Hypothesis test about a population mean

The test statistic uses the hypothesized mean.

  • \(t = \frac{\overline{y}-\mu_{0}}{\frac{s}{\sqrt{n}}}\)

  • No units

  • The number of standard deviations the sample average is away from the hypothesized mean

  • Has a T distribution with \(df=n-1\)

Suppose we want to determine if the mean highway speed is more than 65mph.

\[\begin{eqnarray*} H_{0} & : & \mu = 65 \\ H_{1} & : & \mu > 65 \end{eqnarray*}\]

Use the function t.test to obtain a p-value for a given hypothesis test.

  • \(x\): data

  • alternative: alternative hypothesis

  • mu: hypothesized mean

t.test(highwaySpeeds$Speed, mu=65, alternative="greater");

Just how extreme is that statistic?

x = sort(rt(1000, df=39));
dx = dt(x, df=39);
plot(x, dx, type="l", col="blue", xlim=c(-3,3));
abline(v=2.8794, col="orange");

Bootstrapped P-value

Consider the hypothesis test about the mean highway speed.

\[\begin{eqnarray*} H_{0} & : & \mu = 65 \\ H_{1} & : & \mu > 65 \end{eqnarray*}\]

We can use bootstrapping to obtain an approximate p-value as well. Before resampling, recenter the values with the below.

\[y* = y - \overline{y} + \mu_{0}\]

Then, resample the \(y*\) values, store the resulting sample average, and then find the proportion of sample averages that are larger than the observed sample average.

yStar = highwaySpeeds$Speed - mean(highwaySpeeds$Speed) + 65;
allTheMeans = NULL;
simNum = 1000;
for(i in 1:simNum){
  sampleTemp = sample(yStar, size=length(yStar), replace=TRUE);
  allTheMeans = c(allTheMeans, mean(sampleTemp));
}

hist(allTheMeans);
abline(v=mean(highwaySpeeds$Speed), col="orange")

sum(allTheMeans > mean(highwaySpeeds$Speed)) / simNum;

One-Sample Activity

  1. Perhaps you have seen this simulation applet? Your goal is to simulate the distribution of the sample average/mean, and reproduce the first three graphs (ie the parent population, last sample, simulated distribution of the sample average/mean) on the same graphsheet.

    1. Let Y have an Exponential distribution with rate parameter of 1. Plot the PDF.

    2. Randomly sample 10 observations from the distribution. Find the mean of the sample.

    3. Use a for loop to repeat step b, storing the sample mean in each iteration. Use a simulation size of 1000.

    4. Compute the mean and standard deviation of the simulated sample mean.

    5. On the same graph device, plot the exponential distribution, a histogram of the last sample generated, and the histogram of the simulated sample means.

  2. The data frame coasters in the workspace classData.Rdata contains information on the age of roller coasters (in years). A safety organization is concerned about the age of the roller coasters in particular roller coasters older than 24 years old. Analyze these data using an EDA, confidence interval, and hypothesis test to determine if the average roller coaster age is more than 24 years old. Use the T distribution for your statistical inference.

Homework #10

  1. Let X have an Exponential distribution with a mean of 20 and Y have an Exponential distribution with a mean of 10. Using a sample size of 10 from each distribution, simulate the distribution of the ratio of the sample medians, \(m(X)/m(Y)\). Make a histogram of the simulated distribution. Use a simulation size of 1000.

  2. The data frame magSub in the workspace classData.Rdata contains information on individuals from a marketing survey including their income. Conduct an analysis of the incomes including an EDA, confidence interval, and a hypothesis test to determine if the mean income is more than 50k. Use the T distribution for your statistical inference.

EDA of a Numeric Variable and a Categorical Variable

Are the ages of wooden roller coasters more than the ages of steel roller coasters? If so, by how much on average? The data frame coasters contains the age and type of construction for roller coasters.

The data are a SRS of two variables.

  • Response: characteristic of interest (Y)

  • Factor: categorical grouping variable (X)

Factors might affect the response.

  • Like a formula Y~X

  • X is used to predict or explain variation the in Y

The assumption/conditions are similar to one variable.

  • The observations come from a Normal population

  • The observations within each group are assumed to be independent of one another

  • The observations between the groups are independent of one another

Below are the components of an EDA.

  1. Numerical statistics: the mean, median and standard deviation for each group. Are the standard deviations similar or very different?

  2. Histograms (or density estimates): what is the shape of the distribution of the data in each group, do they look Normal? Is the variation in the data the same or different between the groups?

  3. Boxplots: are the distributions skewed? Are there outliers in each group? If there are, do not remove them for now. Is the variation in the data the same or different between the groups?

  4. Normality Test: do a test of Normality for the data in each group. You can tapply the function shaprio.test.

  5. Test of Unequal Variances: we can test the below hypothesis to help determine what to assume about the population variances (or standard deviations).

    \[\begin{eqnarray*} H_{0} & : & \sigma_{1}^{2} = \sigma_{2}^{2} \\ H_{1} & : & \sigma_{1}^{2} \ne \sigma_{2}^{2} \end{eqnarray*}\]

    Use the function bartlett.test, the primary argument is a formula much like boxplot.

    bartlett.test(Age~Type, data=coasters);

Confidence Interval for the Difference of Two Population Means

The population parameters, sample data and statistics are summarized below.

Parameters Sample Statistics
\(\mu_{1}, \sigma_{1}\) \(y_{1,1},...,y_{1,n_{1}}\) \(\overline{y}_{1}, s_{1}, n_{1}\)
\(\mu_{2}, \sigma_{2}\) \(y_{2,1},...,y_{2,n_{2}}\) \(\overline{y}_{2}, s_{2}, n_{2}\)

A \(100(1-\alpha)\%\) confidence interval for \(\mu_{1}-\mu_{2}\) depends on the population variances \(\sigma_{1}^{2}\) and \(\sigma_{2}^{2}\).

  • \(\overline{y}_{1}-\overline{y}_{2} \pm t SE(\overline{y}_{1}-\overline{y}_{2})\)

  • The \(t\) and \(SE\) depend on the population variances

  • This is why the EDA investigating the variation is so important

There are two approaches that should be determined by your EDA.

  • Unequal Variances

    • \(\sigma_{1}^{2} \ne \sigma_{2}^{2}\)

    • The variances are different

    • Each is estimated by their sample variance

    • \(SE(\overline{y}_{1}-\overline{y}_{2})=\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}\)

    • The DF is only an approximation (horrible formula R will compute)

  • Equal Variances

    • \(\sigma_{1}^{2}=\sigma_{2}^{2}\)

    • There is only a single population variance \(\sigma^{2}\) to estimate

    • The two estimates are combined or pooled

    • \(s_{p} = \sqrt{\frac{(n_{1}-1)s_{1}^{2}+(n_{2}-1)s_{2}^{2}}{n_{1}+n_{2}-2}}\)

    • \(SE(\overline{y}_{1}-\overline{y}_{2})=\sqrt{\frac{s_{p}^{2}}{n_{1}}+\frac{s_{p}^{2}}{n_{2}}}=s_{p}\sqrt{\frac{1}{n_{1}}+\frac{1}{n_{2}}}\)

    • \(df=n_{1}+n_{2}-2\)

Use the function t.test with a formula input.

  • x: response~factor

  • data: data frame

  • var.equal: TRUE specifies equal variances

  • conf.level: confidence level

t.test(Age~Type, data=coasters, var.equal=FALSE);

To switch the order of differencing multiply the confidence interval by -1.

Hypothesis Test for the Difference of Two Population Means

Let \(D_{0}\) be the hypothesized difference.

Two-tailed Right-tailed Left-tailed
\(H_{0}:\mu_{1}-\mu_{2} = D_{0}\) \(H_{0}:\mu_{1}-\mu_{2} = (\leq) D_{0}\) \(H_{0}:\mu_{1}-\mu_{2} = (\geq) D_{0}\)
\(H_{1}:\mu_{1}-\mu_{2} \ne D_{0}\) \(H_{1}:\mu_{1}-\mu_{2} > D_{0}\) \(H_{1}:\mu_{1}-\mu_{2} < D_{0}\)
\(2P(T>|t|)\) \(P(T>t)\) \(P(T<t)\)

Table: Two-sample T-tests

The test statistic typically has \(D_{0}=0\).

  • \(t=\frac{\overline{y}_{1}-\overline{y}_{2}-D_{0}}{SE(\overline{y}_{1}-\overline{y}_{2})}\)

  • The DF and SE depend on the population variances

Suppose you want to determine if the mean age of steel roller coasters is less than the mean age of wooden roller coasters. Let \(\mu_{1}\) be the mean age of steel roller coasters and \(\mu_{2}\) be the mean age of wooden roller coasters.

\[\begin{eqnarray*} H_{0} & : & \mu_{1}-\mu_{2} = 0 \\ H_{1} & : & \mu_{1}-\mu_{2} < 0 \end{eqnarray*}\]

Use the function t.test with a formula input.

  • x: response~factor

  • data: data frame

  • var.equal: TRUE specifies equal variances

  • alternative: alternative hypothesis

  • \(mu\): hypothesized difference

t.test(Age~Type, data=coasters, mu=0, alternative="less", var.equal=FALSE);

Homework #11

  1. The data frame magSub in the workspace classData.Rdata contains information on individuals from a marketing survey including their income and if the individual is a subscriber to a certain magazine. Conduct an analysis of income by subscriber including an EDA, confidence interval, and hypothesis test. Use your inference (based on the T distribution) to determine whether the mean income of subscribers is more than the mean income of nonsubscribers.

  2. Refer to your analysis of income by subscriber. Use bootstrapping to simulate the distribution of the difference between the sample averages. Create a histogram of the simulated distribution and obtain an approximate 95% confidence interval for the difference between the population means. Use 1000 simulations. You are not allowed to use a built-in function to do this.

  3. Perhaps you have seen this simulation applet? Refer to your analysis of income by subscriber. Your goal is to implement a randomization test for the difference of the population means like the applet does.

    Assume the null hypothesis about the means is true and the population variances are equal. Then, we are basically assuming that both sets of incomes come from the same distribution. So, you can combine all of the data, randomly break the data into two groups without replacement (using the two sample sizes from the original data), find the resulting sample averages and the difference between them.

    Make a histogram of the simulated differences between the sample averages, and use the simulated distribution to obtain an approximate p-value for the hypothesis test about the population means. The p-value is the proportion of simulated differences that are more extreme than the observed difference. Use 1000 simulations. Do not alter magSub in any way. You are not allowed to use a built-in function to do this. You might find the function setdiff helpful here.

Week 10

One-Way ANOVA

Suppose a fast food firm has three promotions, and wants to know how the weekly sales are affected by the promotions. The data frame fastFood contains the weekly sales (SalesInThousands, in thousands of dollars) and a variable indicating the promotion (Promotion) for its stores in a single month.

The data are a SRS of two variables.

  • Response: characteristic of interest (Y)

  • Factor: categorical grouping variable (X)

Factors might affect the response.

  • Like a formula Y~X

  • X is used to predict or explain variation the in Y

The assumption/conditions are similar to one variable.

  • The observations come from a Normal population

  • The observations within each group are assumed to be independent of one another

  • The observations between the groups are independent of one another

Let \(k\) represent the number of categories, levels or groups. The total sample size is \(N = n_{1} + ... + n_{k}\).

Parameters Sample Statistics
\(\mu_{1}, \sigma\) \(y_{1,1},...,y_{1,n_{1}}\) \(\overline{y}_{1}, s_{1}, n_{1}\)
\(\vdots\) \(\vdots\) \(\vdots\)
\(\mu_{k}, \sigma\) \(y_{k,1},...,y_{k,n_{k}}\) \(\overline{y}_{k}, s_{k}, n_{k}\)

The EDA is identical to the EDA for the difference of two population means with output for each group. The only difference is the general test for unequal variances.

\[\begin{eqnarray*} H_{0} & : & \sigma_{1}^{2} = ... = \sigma_{k}^{2} \\ H_{1} & : & \text{at least two population variances are different} \end{eqnarray*}\]

Use the function bartlett.test, the primary argument is a formula much like boxplot.

bartlett.test(SalesInThousands~Promotion, data=fastFood);

The general form of the hypothesis test about the means is below.

\[\begin{eqnarray*} H_{0} & : & \mu_{1} = ... = \mu_{k} \\ H_{1} & : & \text{at least two population means are different} \end{eqnarray*}\]

The test statistic compares the between treatments estimate of \(\sigma^{2}\), \(MSTr\), to the within treatments estimate of \(\sigma^{2}\), \(MSE\).

\[f = \frac{MSTr}{MSE}\]

  • Has an F distribution with \(k-1\) numerator degrees of freedom and \(N-k\) denominator degrees of freedom

  • When the null hypothesis is true, the test statistic should be about 1

  • All we care about is how much larger \(MSTr\) is in comparison to \(MSE\)

  • The test is a right-tailed test and the p-value is given by \(P(F_{k-1, N-k}>f)\)

In R, use the function aov to create an ANOVA object using a formula specification like with boxplot.

anovaObj = aov(SalesInThousands~Promotion, data=fastFood);

Then obtain a summary of the object to get an ANOVA table.

summary(anovaObj);

Use the function TukeyHSD to obtain Tukey-Kramer confidence intervals, the input is the aov object you created previously.

TukeyHSD(anovaObj);

You can also make a plot of the pairwise confidence intervals.

plot(TukeyHSD(anovaObj));

You can get a connecting letters report. The package agricolae can be a bit fussy with the install.

library(agricolae);
hsdOut = HSD.test(anovaObj, "Promotion", group=TRUE);
hsdOut;

Homework #12

  1. The data frame magSub in the workspace classData.Rdata contains information on individuals from a marketing survey including their income and the marketing segment they belong to. Conduct an analysis of the income by marketing segment including an EDA, hypothesis test, and if needed a Tukey multiple comparisons.